AMID: Approximation of MultI-measured Data using...

18
AMID: Approximation of MultI-measured Data using SVD Jun-Ki Min a , Chun-Hee Lee b , Chin-Wan Chung b, * a School of Internet-Media Engineering, Korea University of Technology and Education, Byeongcheon-myeon, Cheonan, Chungnam 330-708, Republic of Korea b Division of Computer Science, Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology (KAIST), Daejon 305-701, Republic of Korea article info Article history: Received 13 April 2008 Received in revised form 6 April 2009 Accepted 8 April 2009 Keywords: Approximation Multi-measured data SVD Wavelet Eckart–Young theorem Incremental update abstract Approximate query answering has recently emerged as an effective method for generating a viable answer. Among various techniques for approximate query answering, wavelets have received a lot of attention. However, wavelet techniques minimizing the root squared error (i.e., the L 2 norm error) have several problems such as the poor quality of recon- structed data when the original data is biased. In this paper, we present AMID (Approxima- tion of MultI-measured Data using SVD) for multi-measured data. In AMID, we adapt the singular value decomposition (SVD) to compress multi-measured data. We show that SVD guarantees the root squared error, and also drive an error bound of SVD for an individ- ual data value, using mathematical analyses. In addition, in order to improve the accuracy of approximated data, we combine SVD and wavelets in AMID. Since SVD is applied to a fixed matrix, we use various properties of matrices to adapt SVD to the incremental update environment. We devise two variants of AMID for the incremen- tal update environment: incremental AMID and local AMID. To the best of our knowledge, our work is the first to extend SVD to incremental update environments. Ó 2009 Elsevier Inc. All rights reserved. 1. Introduction In general, traditional database management systems (DBMSs) generate exact query results with respect to user requests. However, due to the explosive growth of networking in recent years, a large volume of data is transmitted into a system through the internet continuously. In this situation, to generate exact query results, a system may have to scan an enormous amount of data and wastes valuable resources (e.g., time, disk space, and computing power). In particular, time critical applications such as decision support systems (DSSs) require a fast response in order to provide viable information to users. Due to the exploratory nature of many DSS applications, an exact result may not be required, while a user prefers a fast answer. Thus, approximate query answering has recently emerged as an effective method for generating viable answers to complex queries against a large volume of data. In recent years, in order to facilitate the approximate query answering, much research such as random sampling [10,9,16,28], histogram [17,18,24], and wavelet [2,6–8,13,19,23] has been conducted. Random sampling and histogram have long and rich history in the query optimization area. In order to estimate the accu- rate result size of queries, the statistics of data distribution is required. The most accurate statistics of the data is the data itself. However, the result size estimation using data itself is absurd. Thus, in order to represent the statistics, a small data set is selected based on a probability model in random sampling. In histogram, the data distribution is represented by buckets in which the summary data (e.g., the frequency of data values) is maintained. 0020-0255/$ - see front matter Ó 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.ins.2009.04.008 * Corresponding author. E-mail addresses: [email protected] (J.-K. Min), [email protected] (C.-H. Lee), [email protected], [email protected] (C.-W. Chung). Information Sciences 179 (2009) 2833–2850 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins

Transcript of AMID: Approximation of MultI-measured Data using...

Page 1: AMID: Approximation of MultI-measured Data using …islab.kaist.ac.kr/chungcw/amid_published.pdfAMID: Approximation of MultI-measured Data using SVD Jun-Ki Mina, Chun-Hee Leeb, Chin-Wan

Information Sciences 179 (2009) 2833–2850

Contents lists available at ScienceDirect

Information Sciences

journal homepage: www.elsevier .com/locate / ins

AMID: Approximation of MultI-measured Data using SVD

Jun-Ki Min a, Chun-Hee Lee b, Chin-Wan Chung b,*

a School of Internet-Media Engineering, Korea University of Technology and Education, Byeongcheon-myeon, Cheonan, Chungnam 330-708, Republic of Koreab Division of Computer Science, Department of Electrical Engineering and Computer Science, Korea Advanced Institute of Science and Technology (KAIST),Daejon 305-701, Republic of Korea

a r t i c l e i n f o

Article history:Received 13 April 2008Received in revised form 6 April 2009Accepted 8 April 2009

Keywords:ApproximationMulti-measured dataSVDWaveletEckart–Young theoremIncremental update

0020-0255/$ - see front matter � 2009 Elsevier Incdoi:10.1016/j.ins.2009.04.008

* Corresponding author.E-mail addresses: [email protected] (J.-K. Min), lee

a b s t r a c t

Approximate query answering has recently emerged as an effective method for generatinga viable answer. Among various techniques for approximate query answering, waveletshave received a lot of attention. However, wavelet techniques minimizing the root squarederror (i.e., the L2 norm error) have several problems such as the poor quality of recon-structed data when the original data is biased. In this paper, we present AMID (Approxima-tion of MultI-measured Data using SVD) for multi-measured data. In AMID, we adapt thesingular value decomposition (SVD) to compress multi-measured data. We show thatSVD guarantees the root squared error, and also drive an error bound of SVD for an individ-ual data value, using mathematical analyses. In addition, in order to improve the accuracyof approximated data, we combine SVD and wavelets in AMID.

Since SVD is applied to a fixed matrix, we use various properties of matrices to adapt SVDto the incremental update environment. We devise two variants of AMID for the incremen-tal update environment: incremental AMID and local AMID. To the best of our knowledge,our work is the first to extend SVD to incremental update environments.

� 2009 Elsevier Inc. All rights reserved.

1. Introduction

In general, traditional database management systems (DBMSs) generate exact query results with respect to user requests.However, due to the explosive growth of networking in recent years, a large volume of data is transmitted into a systemthrough the internet continuously. In this situation, to generate exact query results, a system may have to scan an enormousamount of data and wastes valuable resources (e.g., time, disk space, and computing power).

In particular, time critical applications such as decision support systems (DSSs) require a fast response in order to provideviable information to users. Due to the exploratory nature of many DSS applications, an exact result may not be required,while a user prefers a fast answer. Thus, approximate query answering has recently emerged as an effective method forgenerating viable answers to complex queries against a large volume of data. In recent years, in order to facilitate theapproximate query answering, much research such as random sampling [10,9,16,28], histogram [17,18,24], and wavelet[2,6–8,13,19,23] has been conducted.

Random sampling and histogram have long and rich history in the query optimization area. In order to estimate the accu-rate result size of queries, the statistics of data distribution is required. The most accurate statistics of the data is the dataitself. However, the result size estimation using data itself is absurd. Thus, in order to represent the statistics, a small data setis selected based on a probability model in random sampling. In histogram, the data distribution is represented by buckets inwhich the summary data (e.g., the frequency of data values) is maintained.

. All rights reserved.

[email protected] (C.-H. Lee), [email protected], [email protected] (C.-W. Chung).

Page 2: AMID: Approximation of MultI-measured Data using …islab.kaist.ac.kr/chungcw/amid_published.pdfAMID: Approximation of MultI-measured Data using SVD Jun-Ki Mina, Chun-Hee Leeb, Chin-Wan

Current Analysis

DataWarehouse

Snapshot

HistoricalAnalysis

…measure

Data Stream

Approximate Query Answer

Fig. 1. The computation model of our work.

2834 J.-K. Min et al. / Information Sciences 179 (2009) 2833–2850

Matias et al. [23] proposed a histogram method using wavelets. After their work, the wavelet-based techniques for queryoptimization and approximate query answering have received significant attention. Wavelets are a mathematical tool for ahierarchical decomposition of functions. By storing a few wavelet coefficients, the data can be stored on a small disk spacewith a little loss of accuracy and an approximate query result can be obtained efficiently.

In DSS applications, data generally consists of multiple measures. For example, a stock market database includes infor-mation on the corporation number, the trade amount, the upper price, the lower price and so on. These applications gathermulti-measured data transmitted continuously and analyze them on-line. Also, for historical analyses, these applicationsmay generate a snapshot periodically (e.g., daily or weekly).

The computation model of our work is shown in Fig. 1. For the current analyses, data is kept in the memory for a specifiedperiod of time. Collected data (snapshot) will be stored approximately in a data warehouse. Approximated data in a datawarehouse is used for historical analyses.

In order to support approximate query answering in multiple measure environments using wavelets, Deligiannakis andRoussopoulos [3] presented extended wavelets. As mentioned in [3], wavelets cannot easily adapt to multi-measured data. Inorder to reduce the disk space and minimize the root squared error, the extended wavelet records multiple wavelet coeffi-cients for different measures. In order to improve the time and space complexity for generating the extended wavelet, Guhaet al. [14] suggested XWAVE which is based on a dynamic programming formulation minimizing the L2 norm errorefficiently.

However, recent works have shown that the wavelet techniques based on minimizing the L2 norm error can suffer fromimportant problems such as the severe bias and wide variance in the quality of reconstructed data, and the lack of the errorbound for an individual approximate answer [6,8]. Actually, the L2 norm error is greater than or equal to the maximum error(i.e., L1) for an individual approximate answer. But, it does not provide the tight bound of the maximum error.

1.1. Our contribution

In this paper, we propose a data approximation method for solving problems of the wavelet techniques mentioned above.We take a different approach, called AMID (Approximation of MultI-measured Data using SVD), for multi-measured environ-ments. AMID utilizes singular value decomposition (SVD) [11,22] which has been employed for diverse image applicationssuch as compression and feature extraction. Also, for historical analyses in DSS systems, we propose incremental updatemethods based on SVD.

Multi-measured data is treated as a two-dimensional matrix. SVD of this matrix provides a medium to extract dominantvectors effectively. Using the extracted dominant vectors, the original matrix can be represented approximately.

SVD is a numerical tool, which effectively decomposes a matrix into two orthogonal matrices and its singular values. Thusa matrix A is decomposed into A ¼ URVT ,1 where A is an m� n matrix that we want to summarize, R is an n� n diagonalmatrix, U is an m� n column-orthogonal matrix,2 and V is an n� n orthogonal matrix.3

The contributions of this paper are as follows:

� Guarantee maximum absolute error for an individual data value: The wavelet techniques based on minimizing the L2

norm error do not suggest the tight error bound for an individual data value. SVD guarantees the L2 norm error. In addi-tion, in this paper, based on mathematical analysis of SVD, we derive the error bound for each data value.

1 VT is the transpose of V.2 M is a column-orthogonal matrix if MT M ¼ I.3 M is an orthogonal matrix if MT M ¼ I and MMT ¼ I.

Page 3: AMID: Approximation of MultI-measured Data using …islab.kaist.ac.kr/chungcw/amid_published.pdfAMID: Approximation of MultI-measured Data using SVD Jun-Ki Mina, Chun-Hee Leeb, Chin-Wan

J.-K. Min et al. / Information Sciences 179 (2009) 2833–2850 2835

� Combine SVD and wavelets for multi-measured environments: Although SVD presents an effective mechanism for approx-imation of data, we adapt wavelets after applying SVD in order to improve the accuracy of approximation. We showthat combining SVD and wavelets achieves less error ratio compared to the utilization of only SVD in the experiment.

� Adapt to the incremental update environments: To the best of our knowledge, SVD considers only the preexisting matrix.In incremental update environments, the current snapshot, which is generated for historical analysis, should be con-solidated with the previously archived data. A naive approach is that the previous compressed data is reconstructedand the whole data including the reconstructed data and the current snapshot is compressed using SVD. This approachwastes the computing power and memory. Thus, we devise efficient SVD algorithms to reflect the current snapshot intocompressed data without the whole reconstruction.

In addition, to demonstrate the effectiveness of AMID, we implemented various versions of AMID. We conducted anextensive experimental study with real-life and synthetic data sets. Our experiments show that AMID achieves an improve-ment of accuracy compared to other approaches.

Remarks on the originality: In this paper, to improve the accuracy of approximation for multi-measured environments, wepropose a novel method to combine SVD and wavelets and provide the error bound for an individual data value. Also, weadapt SVD to the incremental update environments. To the best of our knowledge, applying SVD to incremental update envi-ronments was not considered previously.

1.2. Organization

The remainder of the paper is organized as follows. In Section 2, we present previous work. We describe the basics of SVDand wavelets in Section 3. In Section 4, we present the details of AMID and the error bound of SVD. Section 5 presents anextension of AMID for the incremental update environments. Section 6 contains the results of our experiments. Finally, inSection 7, we summarize our work.

2. Previous work

In order to support efficient approximate query processing, various techniques which represent a huge amount of datawith small disk space have been proposed. The representative techniques among them are sampling [1,10,9,16,28], histo-gram [17,18,24], and wavelet [2,6–8,13,19,23].

The basic idea of sampling is that a small amount of samples of data well represents the data. In [28], the reservior sam-pling algorithm was presented which can be used to create and maintain a set of random samples of a fixed size with a lowoverhead. In [16], a probabilistic guarantee of sampling was presented based on Hoeffding’s Inequality. Gibbons and Matias[9] introduced concise samples and counting samples in which the duplicated samples are represented as hvalue,counti. Toestimate the number of distinct values, distinct sampling was proposed [10]. However, the sample technique must take en-ough samples to achieve the desired accuracy. And, the sampling technique does not suggest accurate answers for someaggregate functions such as minimum and maximum.

The histogram has been widely used for the selectivity estimation in query optimization [5]. The histogram approximatesthe frequency distribution of element values. It partitions the data distribution into a small set of intervals called buckets toapproximate the data distribution and keeps some statistics such as frequency in each bucket.

The histogram methods are classified into various methods with respect to the partitioning policy of data. An intuitivemethod is the equi-width histogram. In the equi-width histogram, the lengths of all intervals (i.e., buckets) are equal andthe statistic value of each bucket denotes the number of data items appearing in the interval. In the equi-depth histogram[24], each bucket has the same number of data items while the widths of buckets are different. In the V-Optimal histogram[17], the sum of weighted variances of buckets is minimized. The V-optimal histogram has been shown to be the most accu-rate histogram [17,18].

Among various wavelet transformations, the Haar wavelet is utilized in diverse techniques due to its simplicity and effi-ciency. The Haar wavelet converts a sequence of values into wavelet coefficients. To compact the information, wavelet thres-holding is applied. Matias et al. [23] proposed a wavelet-based histogram technique in which a lot of small sized buckets arecompressed. In the experiment of Matias et al. [23], they showed that the wavelet techniques were more accurate than sam-pling and general histogram techniques. After their work, the wavelet-based techniques for query optimization and approx-imate query processing have received significant attention.

In order to approximate multi-measured data effectively, extended wavelets have been proposed [3]. A coefficient of ex-tended wavelets for M measures consists of hI; B;Vi, where I denote the location of an extended wavelet coefficient, B is an M-bit bitmap indicating whether a Haar wavelet coefficient for a measure is recorded in V, the list of Haar wavelet coefficients.The work of Guha et al. [14] proposed XWAVE which improves the time and space complexity. In [15], Guha improved thespace complexity of extended wavelets. These techniques, related to extended wavelets, commonly minimize the L2 normerror. It is known that the square of the root mean squared error is equal to the sum of the square of the coefficients whichare dropped. Based on the fact, small coefficients are removed by the wavelet thresholding procedure in order to minimizethe L2 norm error.

Page 4: AMID: Approximation of MultI-measured Data using …islab.kaist.ac.kr/chungcw/amid_published.pdfAMID: Approximation of MultI-measured Data using SVD Jun-Ki Mina, Chun-Hee Leeb, Chin-Wan

Fig. 2. An example of wavelet transformation.

2836 J.-K. Min et al. / Information Sciences 179 (2009) 2833–2850

As reported in [6], the reconstructed values using the wavelet techniques minimizing the L2 norm error are quite differentfrom the original values when the original values are biased in some regions. For example, the wavelet coefficients 48, 45, 89,�14 are obtained from the original value 182, 4, 2, 4. Assume that only the two largest coefficients 48, 89 remain by threshold-ing. The reconstructed values using retained coefficients are 137, �41, 48, 48. The reconstructed values are quite different fromthe original values. In addition, the wavelet techniques minimizing the L2 norm error do not guarantee the error bound of anindividual data value.

In order to overcome these problems, various techniques [6–8,19] which minimize the maximum absolute error (i.e., L1norm error) or the maximum relative error have been proposed. Unfortunately, these techniques consume much time com-pared to the wavelet techniques with the L2 norm error.

Also, for minimizing the general norm error Lp (including the maximum error), Guha and Harb [12] proposed an unre-stricted model based on dynamic programming. In the previous model (i.e., restricted model), wavelet synopses are limitedto wavelet coefficients. However, in [12], wavelet synopses are chosen to arbitrary real numbers in order to reduce the gen-eral norm error Lp. Later, Karras and Mamoulis [20] proposed the Haar+Tree which was a refined, wavelet-inspired datastructure. The Haar+Tree composes a single root coefficient and the triads which have a head coefficient and two supplemen-tary coefficients. Using the Haar+Tree, Karras and Mamoulis [20] improved both the accuracy and the running time comparedto the work of Guha and Harb [12]. However, in both [12,20], it is assumed that the maximum value for all data is known.

Recently, Saint-Paul and Mouaddlib [26] and Yager and Filev [29] presented summarization techniques. Although themethod in [26] is useful for database browsing, it provides only the linguistic summarization in which the compressed datais represented by discrete symbols. Yager and Filev [29] deal with the problem of summarizing data, but focuses on one-dimensional data.

To the best of our knowledge, the first work applying SVD to compress large data sets is SVDD [21]. To improve the accu-racy of SVD, the largest difference values between the actual values and the reconstructed values are kept in SVDD. In addi-tion, in the work of Poosala and Ioannidis [25], SVD was applied to estimate the join selectivity.

3. Preliminaries

In this section, we introduce the basics of Haar wavelet and singular value decomposition (SVD).

3.1. Haar wavelet

Wavelets are an important data analysis tool and has been used in image analysis and signal processing for a long time.Much work is based on Haar wavelet among various wavelets due to its simplicity and efficiency. Given N data items, Haarwavelet computes the averages and differences of the values together pairwise. Then, Haar wavelet obtains the averages andthe differences of the computed averages. This process is repeated until only one average and N � 1 differences are produced.The transformed values are called wavelet coefficients.

Note that the wavelet coefficients have different weights in reconstructing the values. Thus, the wavelet coefficients atlevel l are divided by

ffiffiffiffi2l

p. It is called the normalization of wavelet coefficients. For example, the data S = {17,11,20,16} is

transformed into 16;�2;3=ffiffiffi2p

;2=ffiffiffi2pn o

. The full transformation procedure is presented in Fig. 2.The number of transformed data items is equal to the number of original data items. To compact the information, wavelet

thresholding is applied. For the wavelet thresholding to minimize the root squared error (the L2 norm error), Lemma 1 isuseful. From Lemma 1, if we consider C0 as the set whose elements are selected from C by the wavelet thresholding (i.e.,c0i is ci or 0) and D0 as the approximate data for D, the fact that the selection of the largest coefficients minimizes the rootsquared error is easily understood.

Lemma 1. Let D ¼ fd1; d2; . . . ; dng and D0 ¼ fd01; d02; . . . ; d0ng be original data. Let C ¼ fc1; c2; . . . ; cng and C0 ¼ fc01; c02; . . . ; c0ng be

the normalized one-dimensional wavelet coefficients for D and D0, respectively. Then,

4 For

Xn

i¼1

ðdi � d0iÞ2 ¼ n

Xn

i¼1

ðci � c0iÞ2

simplicity, the normalization is not applied.

Page 5: AMID: Approximation of MultI-measured Data using …islab.kaist.ac.kr/chungcw/amid_published.pdfAMID: Approximation of MultI-measured Data using SVD Jun-Ki Mina, Chun-Hee Leeb, Chin-Wan

Fig. 3. An SVD example (graph).

J.-K. Min et al. / Information Sciences 179 (2009) 2833–2850 2837

Proof. The above formula is easily derived from Section 2.3 Application I: Compression in [27]. h

3.2. The singular value decomposition

The matrix decomposition is one of the most important areas of linear algebra. Especially, the singular value decompo-sition (SVD) has been used in many scientific areas such as image processing, signal processing and principal componentanalysis. SVD decomposes the matrix into a column-orthogonal matrix ðUÞ, an orthogonal matrix ðVÞ and a diagonal matrixðRÞ (i.e., A ¼ URVT ).

Before introducing the formal description of SVD, we describe the intuitive meaning.In Fig. 3, data items are scattered in the two-dimensional space. Since most data is clustered on axis x0, axis x0 is a principal

component to represent data items in Fig. 3. And, the next important axis is y0. So, we can change the coordinate system withaxes x and y to another coordinate system with axes x0 and y0 in order to represent the characteristics of data effectively.

Since most data items are closely located to axis x0, we can remove the information related to axis y0 with a little loss ofaccuracy. Thus, an accurate approximation can be achieved by keeping a piece of information related to the dominant axes.SVD enables us to find the principal components for given data.

The data plotted in Fig. 3 can be represented as a 7� 2 matrix with x and y coordinates. If we transform data of the x and ycoordinates into data of the x0 and y0 coordinates, we can represent the transformation with three matrices, U; R, and V. Thenew axes x0 and y0 are in V and the importance of each axis is represented in R. UR keeps the coordinate value of each dataitem in the x0 and y0 coordinate system.

The diagonal entries of R are sorted in descending order. Thus, less important axes and related information can bedropped in order to compact the information. We describe SVD formally using the following notational convention:

� Matrices: A;B;C; . . . (uppercase letters)� ith column of the matrices A; B;C; . . . : ai; bi; ci; . . .

� Vectors: x; y; z; . . . (lowercase letters)� ith row, jth column entry of matrices A;B; C; . . . : aij; bij; cij; . . .

Since SVD decomposes the matrix into the singular values, we first describe the formal definition of the singular valuesand then the theorem about SVD.

Definition 1. The square roots of the eigenvalues of AT A are called the singular values of A.

Theorem 1. Let A be an m� n matrix with rank5 rð6 minðm;nÞÞ. Then there exists an m� n matrix U, an n� n matrix V, an n� ndiagonal matrix R such that A ¼ URVT , where

(1) Rij ¼ri if i ¼ j

5 The

0 otherwise

, ri ð1 6 i 6 rÞ is the singular values of A and r1 P r2 P � � �P rr > 0 and in case of r < n; rrþ1 ¼ � � � ¼ rn ¼ 0.

where

rank of a matrix A is the maximal number of linearly independent columns of A.

Page 6: AMID: Approximation of MultI-measured Data using …islab.kaist.ac.kr/chungcw/amid_published.pdfAMID: Approximation of MultI-measured Data using SVD Jun-Ki Mina, Chun-Hee Leeb, Chin-Wan

2838 J.-K. Min et al. / Information Sciences 179 (2009) 2833–2850

(2) U ¼ ½u1 u2 � � � un � is a column-orthogonal matrix.(3) V ¼ ½v1 v2 � � � vn � is an orthogonal matrix.

Proof. See [22]. h

Since, in case of r < n; rrþ1 ¼ � � � ¼ rn ¼ 0, some column vectors ðurþ1; . . . ;unÞ of U and some column vectors ðvrþ1; . . . ;vnÞof V are useless. Thus, the matrices U; R, and V are compacted as m� r; r � r, and n� r matrices, respectively.

Also, with ui (the ith column of U), v i (the ith column of V), and ri (the ith diagonal entry in R), the matrix A can be recon-structed using the formula

Pri¼1riuivT

i . We show an example of SVD.

Example 1. The 7� 2 matrix A represents points in Fig. 3 and is decomposed into SVD as shown in Fig. 4 (left matrix: U;middle matrix: R; right matrix: V).

v1 ¼ ð�0:71;�0:71Þ and v2 ¼ ð�0:71;0:71Þ are the coordinate values for axes x0 and y0. �v1 (and v2) vector has the samedirection as x0 (and y0) and its length is 1. Also, since data is clustered around the axis x0; R11 ¼ 8:13 is much bigger thanR22 ¼ 0:35.

As mentioned earlier, SVD finds the principal components for a given set of multi-dimensional points. A singular value inR represents the importance of each axis (i.e., v i). Thus, if ri is much larger than the other singular values rj, where i < j 6 n,the points are located closely to v i (i.e., it is highly correlated. Since SVD considers spatial correlation but not temporal cor-relation, the correlation in this paper means the spatial correlation).

4. AMID

In this section, we present the mechanism of AMID which compresses multi-measured data effectively.

4.1. The application of SVD in AMID

We describe the basic compression technique in AMID using SVD. Multi-measured data can be represented as an m� nmatrix, where n is the number of measures and m is the number of multi-measured data items. m is much greater than n inmulti-measured environments.

There are many algorithms for SVD such as the Golub–Reinsh SVD algorithm and the R-SVD algorithm [11]. However, weapply the algorithm in Fig. 5 [22] instead of the Golub–Reinsh SVD algorithm and the R-SVD algorithm since AT A is a smallsized n� n matrix in multi-measured environments. The algorithm generates a compacted form of SVD.

The total running time of the algorithm is Oðmn2 þ nlognþ n3 þmnrÞ. From Line 1, the running time of computing AT A isOðmn2Þ and the time complexity of computing eigenvalues (i.e., R2) and eigenvectors (i.e., V) of AT A is Oðn3Þ [30]. From Line 2,the time complexity of the sorting is OðnlognÞ. From Line 4, the running time of computing a value of ðAv iÞ=ri is OðnmÞ. Thus,since there are r terms for ðAv iÞ=ri; OðnmrÞ is required at Line 4. From Line 5, some column vectors and singular values aredropped with respect to the compression factor that denotes the number of singular values to be retained. By the remainingsingular values, the corresponding columns of U and V also remain. The time complexity of removing columns and singularvalues is Oð1Þ. Due to m� n, the time complexity of the algorithm can be considered as OðmÞ.

4.2. Error bound of SVD thresholding

In order to reduce the size of the matrices generated by SVD, small singular values ðrkþ1; . . . ;rrÞ and related column vec-tors ðukþ1; . . . ;ur ;vkþ1; . . . ;v rÞ of U and V are dropped at Line 5 in Fig. 5. We call this dropping SVD thresholding.

In this section, we show that SVD thresholding minimizes the root squared error of the reconstructed matrix. In addition,we suggest the error bound of an individual value with SVD thresholding that is not suggested in wavelet thresholding min-imizing the root squared error.

Since the errors of vectors and matrices can be represented by norms, we first present the definitions of norms for vectorsand matrices.

Fig. 4. An SVD example (matrix).

Page 7: AMID: Approximation of MultI-measured Data using …islab.kaist.ac.kr/chungcw/amid_published.pdfAMID: Approximation of MultI-measured Data using SVD Jun-Ki Mina, Chun-Hee Leeb, Chin-Wan

Fig. 5. Algorithm for SVD compression.

J.-K. Min et al. / Information Sciences 179 (2009) 2833–2850 2839

Definition 2. The Lp norm for a column vector

6 For

x ¼ x1; x2; . . . ; xnð ÞT 2 Rn

is defined as

kxkp ¼ ðjx1jp þ jx2jp þ � � � þ jxnjpÞ1p; where p P 1; jxij is the absolute value of xi

The Lp norm for a matrix is defined differently using the Lp norm for a vector.

Definition 3. The Lp norm for A 2 Rm�n is defined as6

kAkp ¼ supx–0

kAxkp

kxkp; where x 2 Rn

Another useful norm for a matrix is the Frobenius norm. The Frobenius norm is defined as follows:

Definition 4. The Frobenius norm for A 2 Rm�n is defined as

kAkF ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXm

i¼1

Xn

j¼1

jaijj2vuut

The advantage of SVD is not only to exactly evaluate the Frobenius norm for approximated data but also to suggest theerror bound of an individual data value using Eckart–Young theorem [4].

Theorem 2 (Eckart–Young). Let A be an m� n matrix. By SVD, a matrix A can be expressed asPr

i¼1riuivTi . We definebA ¼Pk

i¼1riuivTi ðk 6 rÞ with the first k terms. Then,

(a) minrankðBÞ¼k

kA� BkF ¼ kA� bAkF ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiXr

r2i

vuut

i¼kþ1

minrankðBÞ¼k

kA� Bk2 ¼ kA� bAk2 ¼ rkþ1

(b)

eorem 2(a) states that remaining the largest singular values and related column vectors minimizes the root squared

Th

error and its error can be evaluated easily byffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPr

i¼kþ1r2i

q. We show Theorem 3 for guaranteeing the error bound for each

entry in a matrix (i.e., maximum absolute error). We first provide the well known Lemma 2 for proving Theorem 3.

Lemma 2. For any matrices A 2 Rm�n,

set S; supx2Sx is the least upper bound of a set S, defined as a quantity M such that no member of the set exceeds M.

Page 8: AMID: Approximation of MultI-measured Data using …islab.kaist.ac.kr/chungcw/amid_published.pdfAMID: Approximation of MultI-measured Data using SVD Jun-Ki Mina, Chun-Hee Leeb, Chin-Wan

2840 J.-K. Min et al. / Information Sciences 179 (2009) 2833–2850

kAkp ¼ maxkxkp¼1

kAxkp; where x 2 Rn

Proof

kAkp ¼ supx–0

kAxkp

kxkp¼ sup

x–0A

xkxkp

!����������

p

¼ maxkxkp¼1

kAxkp �

Theorem 3. Let A be an m� n matrix. By SVD, a matrix A can be expressed asPr

i¼1riuivTi . We define bA ¼Pk

i¼1riuivTi ðk 6 rÞwith

the first k terms and E ¼ A� bA. Then,

jeijj 6 rkþ1 for 8i; j

Proof. By Theorem 2(b), kEk2 ¼ rkþ1

Also, by Lemma 2,

kEk2 ¼ rkþ1 ¼ maxkxk2¼1

kExk2 P kEyk2 for all y;where kyk2 ¼ 1

Therefore, rkþ1 P kEyk2 for kyk ¼ 1. Let yl be a column vector whose lth entry is 1, and other entries are 0. Then, for 1 6 l 6 n,

rkþ1 P kEylk2 ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffie2

1l þ � � � þ e2ml

q:

Since, e21l; . . . ; e2

ml P 0; je1lj; . . . ; jemlj 6 rkþ1. Therefore, jeijj 6 rkþ1 for 8i; j. h

Generally, it is difficult to guarantee the error bound for an individual data value in the data compression. However, byTheorem 3, we can guarantee the error bound for an individual value in a matrix.

4.3. Combining SVD and wavelets

Although data can be compressed effectively using SVD, there may be important information among lost information. Forimproving the accuracy of the compression, we combine SVD and wavelets.

SVD thresholding removes two columns (one for U and one for V) and a singular value as a unit. Thus, SVD thresholdingdoes not fully utilize the limited storage space in some cases by idling the space smaller than the unit. Therefore, we can fullyutilize the storage space by applying wavelets since the arbitrary number of wavelet coefficients can be selected using wave-let thresholding.

Let an m� n matrix A with rank r bePr

i¼1riuivTi and bA with the compression factor k be

Pki¼1riuivT

i . A naive application ofwavelets is to apply wavelets to the error matrix ð¼ A� bAÞ.

As mentioned earlier, wavelet thresholding which chooses the largest coefficients has the problem of the poor quality ofthe reconstructed data for biased original data. However, by Theorem 3, the error of an entry of the matrix is bounded. Thus,in general, the values in the error matrix are not much biased. However, in order to apply wavelet transform, we must com-pute the error matrix by subtracting the reconstructed matrix from an original matrix.

Also, as mentioned in Section 2, SVDD [21] keeps some differences between actual values and reconstructed values inorder to improve the accuracy and to utilize the space fully. Thus, the construction of the error matrix is required in SVDD.

The idea combining SVD and wavelets in AMID is to apply wavelets to each column vector of U that will be removed.Therefore, the overhead of constructing the error matrix is avoided. Recall that m is much greater than n and r. Thus, it needsthe small storage space to store n� r matrix V and r singular values. However, storing an m� 1 column vector of U consumesmuch storage space. Thus, AMID applies wavelets to the column vectors of U.

Each ui that will be dropped has a different weight in the reconstruction of data. Therefore, we give a weight wðiÞ to ui.Suppose that one term riuivT

i is removed:

riuivTi ¼ uiðrivT

i Þ ¼ ui½riv1i riv2i � � � rivni � ¼ ½riv1iui riv2iui � � � rivniui �

A wavelet transform is a linear transformation since a wavelet transform is the process of averaging and subtracting values.That is, WTðcuÞ ¼ cWTðuÞ, where WT is a wavelet transform with the normalization, c 2 R, and u 2 Rn. Therefore,

WTðriuivTi Þ ¼WTð½riv1iui riv2iui � � � rivniui �Þ ¼ ½riv1iWTðuiÞ riv2iWTðuiÞ � � � rivniWTðuiÞ �

As shown in the above equation, we do not apply wavelets to riuivTi but we only apply wavelets to ui since a column vector

riuivTi is multiplication of ui and a scalar value riv ji, where 1 6 j 6 n. Since riv1i; . . ., and rivni are multiplied to the column

vector ui, we define the weight wðiÞ of ui as ri �Pn

j¼1jv jij.In AMID, after computing wavelet coefficients from each dropped column vector of U and multiplying weights for them,

the largest coefficients are selected. For constructing an error matrix with selected coefficients, AMID must keep column vec-tors of V and singular values corresponding to the selected coefficients.

Page 9: AMID: Approximation of MultI-measured Data using …islab.kaist.ac.kr/chungcw/amid_published.pdfAMID: Approximation of MultI-measured Data using SVD Jun-Ki Mina, Chun-Hee Leeb, Chin-Wan

J.-K. Min et al. / Information Sciences 179 (2009) 2833–2850 2841

In order to fully utilize the storage space, column vectors of V and singular values that are not corresponding to the se-lected wavelet coefficients are dropped. For brevity, we omit the detailed procedure of this.

The algorithm of combining SVD and wavelets is summarized in Fig. 6. In Line 1, SVD for A is evaluated. In Line 2, thewavelet coefficients for the dropped columns in U are computed. Then, after computing the weight for each dropped column(Line 3), the wavelet coefficients are multiplied by the corresponding weights in Line 4. In Lines 5–6, we choose the largestcoefficients and store them.

Even though combining SVD and wavelets improves the error, the error bounds for the root squared error and the max-imum absolute error in Section 4.2 cannot be applied to it. We provide a formula of the error bound for the root squared errorwhen AMID is applied, which combines SVD and wavelets.

Theorem 4. Let A be an m� n matrix. By SVD, a matrix A can be expressed asPr

i¼1riuivTi . We define bA ¼Pk

i¼1riuivTi ðk 6 rÞwith

the first k terms and bA0 ¼ bA þPri¼kþ1riuivT

i , where ui is the approximation of ui reconstructed by wavelet coefficients forcombining SVD and wavelets. In addition, ei is the sum of the square of dropped wavelet coefficients for ui. Then, the error matrixE0 ¼ A� bA0 is expressed as follows:

kE0kF 6

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffimðekþ1r2

kþ1 þ � � � þ err2r Þ

q

Proof. First, consider the root squared error ð¼ kE0ikFÞ for the ith column ui which the wavelet approximation is applied to:

E0i ¼ uirivTi � uirivT

i ¼ ðui � uiÞrivTi

Let si ¼ ui � ui,

E0i ¼ sirivTi ¼ risivT

i ¼ ri

s1i

s2i

..

.

smi

0BBBB@1CCCCA v1i v2i � � � vnið Þ ¼ ri

s1iv1i s1iv2i � � � s1ivni

s2iv1i s2iv2i � � � s2ivni

..

.

smiv1i smiv2i � � � smivni

0BBBB@1CCCCA

Then,

kE0ik2F ¼ r2

i ðs21i þ s2

2i þ � � � þ s2miÞv2

1i þ ðs21i þ s2

2i þ � � � þ s2miÞv2

2i þ � � � þ ðs21i þ s2

2i þ � � � þ s2miÞv2

ni

� �¼ r2

i ðmeiv21i þmeiv2

2i þ � � � þmeiv2niÞ ðby Lemma 1Þ

¼ meir2i ðv2

1i þ v22i þ � � � þ v2

niÞ¼ meir2

i ðsince V is an orthogonal matrixÞ

We can apply the above formula to all dropped columns ukþ1;ukþ2; . . . ;ur .

kE0k2F ¼ kE

0kþ1 þ E0kþ2 þ � � � þ E0rk

2F

6 kE0kþ1k2F þ kE

0kþ2k

2F þ � � � þ kE

0rk

2F

¼ mðekþ1r2kþ1 þ ekþ2r2

kþ2 þ � � � þ err2r Þ �

Using Theorem 4, we can provide the error bound for the Frobenius norm when AMID is applied. Also, we prove thatAMID (i.e., combining SVD and wavelets) is better than using only SVD by Corollary 1.

Corollary 1. Let A; bA, and bA0 be the same as defined in Theorem 4. Suppose that E0 ¼ A� bA0 and E ¼ A� bA. Then,

kE0k2F 6 kEk

2F

Fig. 6. Algorithm for combining SVD and wavelets.

Page 10: AMID: Approximation of MultI-measured Data using …islab.kaist.ac.kr/chungcw/amid_published.pdfAMID: Approximation of MultI-measured Data using SVD Jun-Ki Mina, Chun-Hee Leeb, Chin-Wan

2842 J.-K. Min et al. / Information Sciences 179 (2009) 2833–2850

Proof. Since U is a column-orthogonal matrix,

UT U ¼

uT1

uT2

..

.

uTn

0BBBB@1CCCCA u1 u2 � � � unð Þ ¼

uT1u1 uT

1u2 � � � uT1un

..

. ... ..

. ...

uTnu1 uT

nu2 � � � uTnun

0BB@1CCA ¼

1 0 � � � 0... ..

. ... ..

.

0 0 � � � 1

0B@1CA

Therefore, 1 ¼ uTi ui ¼ kuik2

F

By Lemma 1, kuik2F ¼ mkWTðuiÞk2

F .Also, by the definition of ei; kWTðuiÞk2

F P ei is trivially derived.From above equations, 1 ¼ kuik2

F ¼ mkWTðuiÞk2F P mei.

Therefore,

1 P mei ð1Þ

By Theorem 4,

kE0k2F 6 mðekþ1r2

kþ1 þ � � � þ err2r Þ

¼ mekþ1r2kþ1 þ � � � þmerr2

r

6 r2kþ1 þ � � � þ r2

r ðby Eq:Þ¼ kEk2

F ðby Theorem 2� ðaÞÞ

Therefore, kE0k2F 6 kEk

2F . h

The following example shows the behavior of our proposed technique, AMID.

Example 2. Assume that the SVD for the 4� 3 matrix A is represented as below and we can store one column of U and twowavelet coefficients in the storage space. Also, in this example, we do not consider the space to store V and R for conveniencesince U is much larger than V and R.

A ¼ URVT ¼

12

9ffiffiffiffiffiffi164p � 1ffiffiffiffiffiffi

164p

12

1ffiffiffiffiffiffi164p 9ffiffiffiffiffiffi

164p

� 12

9ffiffiffiffiffiffi164p � 1ffiffiffiffiffiffi

164p

� 12

1ffiffiffiffiffiffi164p 9ffiffiffiffiffiffi

164p

0BBBBB@

1CCCCCA5 0 00 1:2 00 0 1

0B@1CA 1 0 0

0 0:6 0:80 0:8 �0:6

0B@1CA

T

The dropped columns are u2 and u3. The (normalized) wavelet coefficients for them are as follows:

WTðu2Þ ¼WT 9ffiffiffiffiffiffi164p 1ffiffiffiffiffiffi

164p 9ffiffiffiffiffiffi

164p 1ffiffiffiffiffiffi

164p

h iT� �

¼ 5ffiffiffiffiffiffi164p 0 4ffiffiffiffiffiffi

328p 4ffiffiffiffiffiffi

328p

h iT

WTðu3Þ ¼WT � 1ffiffiffiffiffiffi164p 9ffiffiffiffiffiffi

164p � 1ffiffiffiffiffiffi

164p 9ffiffiffiffiffiffi

164p

h iT� �

¼ 4ffiffiffiffiffiffi164p 0 � 5ffiffiffiffiffiffi

328p � 5ffiffiffiffiffiffi

328p

h iT

From wðiÞ ¼ ri �Pn

j¼1jv jij, the weights for u2 and u3 are computed as follows:

wð2Þ ¼ 1:2� ðj0j þ j0:6j þ j0:8jÞ ¼ 1:68wð3Þ ¼ 1� ðj0j þ j0:8j þ j � 0:6jÞ ¼ 1:4

We compute wð2Þ �WTðu2Þ and wð3Þ �WTðu3Þ and choose the largest two coefficients. Among f1:68� 5ffiffiffiffiffiffi164p ;0;1:68�

4ffiffiffiffiffiffi328p ;1:68� 4ffiffiffiffiffiffi

328p ;1:4� 4ffiffiffiffiffiffi

164p ;0;�1:4� 5ffiffiffiffiffiffi

328p ;�1:4� 5ffiffiffiffiffiffi

328p g; 1:68� 5ffiffiffiffiffiffi

164p , and 1:4� 4ffiffiffiffiffiffi

164p are selected. Therefore, we choose 5ffiffiffiffiffiffi

164p

(the first coefficient of WTðu2Þ) and 4ffiffiffiffiffiffi164p (the first coefficient of WTðu3Þ) as the stored wavelet coefficients.

5. Enhancement of AMID for incremental updates

As illustrated in Fig. 1, DSS generates snapshots when the multi-measured data arrives continuously. In order to supporthistorical analysis, the current snapshot should be combined with the archived data.

A naive approach of AMID in incremental update environments is to reconstruct the archived data, and combine thecurrent snapshot and the archived data, and then AMID is applied to the consolidated data. It takes a long time to recon-struct the archived data and compute SVD of the consolidated matrix. Thus, we propose the two versions of AMID: Incre-mental AMID and Local AMID. To the best of our knowledge, this is the first work to apply SVD to incremental updateenvironments.

Page 11: AMID: Approximation of MultI-measured Data using …islab.kaist.ac.kr/chungcw/amid_published.pdfAMID: Approximation of MultI-measured Data using SVD Jun-Ki Mina, Chun-Hee Leeb, Chin-Wan

J.-K. Min et al. / Information Sciences 179 (2009) 2833–2850 2843

5.1. Incremental AMID

In this section, we propose an incremental compression technique using the information of precomputed SVD, calledIncremental AMID. Incremental AMID combines the current snapshot to the compressed data without the reconstructionof the archived data. For Incremental AMID, we derive the following formula.

Suppose that an m1 � n matrix A1ð¼ U1R1VT1Þ is the archived data and an m2 � n matrix A2 is the current snapshot, where

m1 � m2 � n.

The consolidated matrix A ¼ A1

A2

� �¼ U1R1VT

1A2

� �.

Then, A can be transformed as follows:

A ¼ U1R1VT1

A2

!

¼U1R1

A2V1

� �VT

1 ðsince V1VT1 ¼ IÞ

¼U1 00 I

� �R1

A2V1

� �VT

1

Then, SVD is applied to the middle matrix such as R1

A2V1

� �¼ U0R0V 0T . Since the middle matrix whose size is ðnþm2Þ � n is

much smaller than the matrix A whose size is ðm1 þm2Þ � n, the computation time of SVD for the middle matrix is muchsmaller than that for the matrix A. Consequently, A is decomposed as follows:

A ¼U1 00 I

� �ðU0R0V 0TÞVT

1 ¼U1 00 I

� �U0

� �R0 V1V 0 T ¼ URVT ;� �� �

where, U ¼ U1 00 I U0 ; R ¼ R0, and V ¼ ðV1V 0Þ

The following formulas prove that U is a column-orthogonal matrix and V is an orthogonal matrix:

U1 00 I

� �U0

� �T U1 00 I

� �U0

� �¼ U0T

U1 00 I

� �T U1 00 I

� �U0 ¼ U0T IU0 ¼ I

Also,

V 0V1 T V 0V1

¼ VT

1V 0T V 0V1 ¼ VT1IV1 ¼ I

V 0V1

V 0V1 T ¼ V 0V1VT

1V 0T ¼ V 0IV 0T ¼ I

Therefore, we can reduce the amount of computing SVD of the consolidated matrix A by computing SVD of R1

A2V1

� �instead

of SVD of A itself. That is, we can compute SVD of A incrementally without the reconstruction of A1.As presented in Section 4.3, A1 is approximated by the combination of SVD and the wavelet technique in AMID. Some col-

umn vectors of U1 dropped by SVD thresholding are compressed by the wavelet technique. For the incremental update, thedropped column vectors of U1 are reconstructed using the stored wavelet coefficients. Then, the above procedure is appliedin Incremental AMID. SVD thresholding and wavelet thresholding are also applied to the dropped columns of U.

In our experiments, we show that Incremental AMID is much faster than the naive approach in incremental updateenvironments.

Example 3. To easily understand how to store data in Incremental AMID, we will explain it with an example. We assume thefollowing:

� We let the 4� 2 matrix A1 the archived data and 4� 2 matrix A2 the current snapshot. A1 is stored in the SVD form ofU1; R1; V1.

� We do not consider the space to store V and R since U is much bigger than V and R. In this example, we have the spacefor storing two 4� 1 column vectors.

� We do not consider combining SVD and wavelets since it is straightforward and makes the example complicated.

The SVDs for A1 and A2 are represented as below:

A1 ¼

2:5 12:5 12:5 �12:5 �1

0BBB@1CCCA ¼ U1R1VT

1 ¼

0:5 0:50:5 0:50:5 �0:50:5 �0:5

0BBB@1CCCA 5 0

0 2

� �1 00 1

� �T

Page 12: AMID: Approximation of MultI-measured Data using …islab.kaist.ac.kr/chungcw/amid_published.pdfAMID: Approximation of MultI-measured Data using SVD Jun-Ki Mina, Chun-Hee Leeb, Chin-Wan

2844 J.-K. Min et al. / Information Sciences 179 (2009) 2833–2850

A2 ¼

0:6 �0:88 60 00 0

0BBB@1CCCA ¼ U2R2VT

2 ¼

0 11 00 00 0

0BBB@1CCCA 10 0

0 1

� �0:8 0:60:6 �0:8

� �T

� � � �

To compute the SVD for the consolidated matrix A ¼ A1

A2, we first compute the SVD ð¼ U0R0V 0TÞ for R1

A2V1:0 1 0 1

R1

A2V1

� �¼

5 00 2

0:6 �0:88 60 00 0

BBBBBBBB@

CCCCCCCCA¼

0:3921 �0:76550:0958 0:50110:0087 �0:29230:9149 0:2784

0 00 0

BBBBBBBB@

CCCCCCCCA10:8812 0

0 3:4059

� �0:8533 �0:52150:5215 0:8533

� �T

With the above result, we compute U; V , and R:

U ¼U1 00 I

� �U0

� �¼

0:5 0:5 0 0 0 00:5 0:5 0 0 0 00:5 �0:5 0 0 0 00:5 �0:5 0 0 0 00 0 1 0 0 00 0 0 1 0 00 0 0 0 1 00 0 0 0 0 1

0BBBBBBBBBBBBB@

1CCCCCCCCCCCCCA

0:3921 �0:76550:0958 0:50110:0087 �0:29230:9149 0:2784

0 00 0

0BBBBBBBB@

1CCCCCCCCA¼

0:24395 �0:13220:24395 �0:13220:14815 �0:63330:14815 �0:63330:0087 �0:29230:9149 0:2784

0 00 0

0BBBBBBBBBBBBB@

1CCCCCCCCCCCCCAV ¼ ðV1V 0Þ ¼

1 00 1

� �0:8533 �0:52150:5215 0:8533

� �¼

0:8533 �0:52150:5215 0:8533

� �R ¼ R0 ¼

10:8812 00 3:4059

� �

Therefore, we can compute the SVD for A without reconstructing the archived data which is generally very large (i.e. withoutthe matrix multiplication U1R1V1 for A1). Since we have the space for two 4� 1 columns, we store one 8� 1 column. There-fore, we store the first column of U (i.e., ½0:24395 0:24395 0:14815 0:14815 0:0087 0:9149 0 0 �T ). In addition, westore the first column of V (i.e., ½0:8533 0:5215 �T ) and the first singular value 10.8812.

5.2. Local AMID

Although Incremental AMID is faster than the naive approach, Incremental AMID requires several matrix multiplications.Thus, we devise Local AMID which improves the compression time. Basically, Local AMID is a technique that compresses thecurrent snapshot such that the interference of the archived data is minimized.

The algorithm of Local AMID is presented in Fig. 7. Let jth snapshot be an mj � n matrix Aj ð1 6 j < iÞ, an mi � n matrix Ai

be a current snapshot, and the disk space be B. Suppose that Aj is stored in the format of Uj; Vj; Rj with some wavelet coef-ficients Wj, for 1 6 j < i.

From Line 1 of the algorithm in Fig. 7, Ai is decomposed into Ui; Vi, and Ri. In order to store Ui; Vi, and Ri, some columnvectors and singular values for Aj should be removed, where 1 6 j < i. Also, Ai needs to be approximated. To determine the

Fig. 7. Local compression algorithm.

Page 13: AMID: Approximation of MultI-measured Data using …islab.kaist.ac.kr/chungcw/amid_published.pdfAMID: Approximation of MultI-measured Data using SVD Jun-Ki Mina, Chun-Hee Leeb, Chin-Wan

Table 1Table of symbols.

Approach Explanation

AMID Our approach for data compressionSVD An SVD approachSVDD An SVD approach with deltas [21]SVDW An SVD approach with applying wavelets to the error matrixEWA Extended wavelets [14]LocAMID Local AMIDIncAMID Incremental AMIDIncSVD A naive approach for incremental updates. To compute SVD for the archived data (the archived data is stored in the form U1; R1; V1) and

the current snapshot ðA2Þ, IncSVD first reconstructs the archived data by computing U1R1VT1ð¼ A1Þ and then computes SVD for the whole

matrix A1A2

� �

J.-K. Min et al. / Information Sciences 179 (2009) 2833–2850 2845

column vectors to be dropped, Local AMID evaluates singular values over all R1; . . . ;Ri (Line 2 and 3). Then, less importantcolumn vectors and singular values are dropped (Line 4). In Line 5, the new best wavelet coefficients are chosen.

By Theorem 2(a), the root squared error is in proportion to the dropped singular values. Since Local AMID drops the small-est singular values over diagonal entries of R1; . . . ;Ri, the root squared error of retained data is minimized. Therefore, thealgorithm in Fig. 7 minimizes the root squared error.

Although Incremental AMID is based on the mathematical analysis, Incremental AMID is a little less accurate than LocalAMID for less correlated data since the archived data is approximated and the error contained in the approximated data ispropagated over the incremental compression in Incremental AMID while Local AMID deals with many matricesindependently.

However, for the case that data is highly correlated, the Incremental AMID is more accurate than Local AMID since Incre-mental AMID deals with the consolidated matrix. But, Local AMID cannot use the correlation over the snapshots since it dealswith the snapshots independently.

In AMID, the wavelet compression is applied to the dropped ui which is related to the dropped singular value. As pre-

sented in Theorem 4 as kE0kF 6

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffimðekþ1r2

kþ1 þ � � � þ err2r Þ

q, the Frobenius norm of the error matrix is proportional to the

sum of the multiplications of a dropped singular value and the sum of the square of dropped wavelet coefficients for ui.As mentioned before, a singular value represents the importance of a dominant vector. Thus, for highly correlated data,the dropped singular values are much smaller than the remaining singular values. This means that the sum of the squareof dropped wavelet coefficients for ui affects the consolidated matrix less. In contrast, for less correlated data, the droppedsingular values are not much smaller than the remaining singular values. Thus, the error caused by the wavelet thresholdingaffects the consolidated matrix much more compared to the case of the highly correlated data.

Example 4. In this example, we will explain how to store data in Local AMID. The assumption in this example is the same asthat in Example 3. In Local AMID, we sort singular values in order to drop columns (i.e., 1 0 5 2 1 ). Since we can storetwo 4� 1 columns, we select the two smallest values (2 and 1) and drop the values and related column vectors. Therefore, inthis example, ½0:5 0:5 0:5 0:5 �T ; ½1 0 �T , and 5 for A1, and ½0 1 0 0 �T ; ½ 0:8 0:6 �T , and 10 for A2 are stored.

6. Experiments

In this section, we demonstrate the effectiveness of AMID. We performed experiments on both real-life data sets and syn-thetic data sets to evaluate the accuracy and efficiency of AMID. Also, for comparing AMID with other approaches, we imple-mented diverse approaches. Table 1 summarizes the symbols and the names of the techniques to explain the experimentalresults.

We first show the accuracy and performance of AMID, SVD, SVDD, SVDW, and EWA using the static data. For extendedwavelets, we use the algorithm presented in XWAVE [14]. In [14], the authors presented a simple intuition of waveletsfor incremental updates but did not describe the detailed procedure of extending XWAVE. In addition, AMID is superiorto extended wavelets in static data environments. Thus, we compared LocAMID, IncAMID, and IncSVD in incrementalenvironments.

6.1. Experimental environment

Our experiments were performed on 3 GHz Pentium 4 with 1024 MB of main memory, running Windows XP.We used stock data as real-life data sets.7 Stock data is a list of prices of stocks per day during 8 years. The stock data has

four measures (the market price, the upper price, the lower price, and the closing price) and 56,174 rows. Also, we generatedsynthetic data with a zipf distribution (Z = 0.5). The synthetic data consists of 30,000 rows with 20 measures. As mentioned ear-

7 We purchased the stock data at Korea Exchange (http://www.krx.co.kr).

Page 14: AMID: Approximation of MultI-measured Data using …islab.kaist.ac.kr/chungcw/amid_published.pdfAMID: Approximation of MultI-measured Data using SVD Jun-Ki Mina, Chun-Hee Leeb, Chin-Wan

2846 J.-K. Min et al. / Information Sciences 179 (2009) 2833–2850

lier, a singular value in R represents the importance of each dominant vector (i.e., v i). Thus, if r1 is much larger than the otherri’s, the rows (i.e., multi-dimensional points which occur at different time) are located closely to v1. It means that the data ishighly correlated. r1 and r2 of the stock data are 338.3791429, 2.753486056, and r1 and r2 of the synthetic data are23.94858559, 1.05308922. In the stock data, r1 is about 123 times larger than r2, while, in the synthetic data, r1 is about23 times larger than r2. Thus, rows of the stock data are more correlated than those of the synthetic data.

In order to demonstrate the accuracy of AMID, we use the root mean square percent error (RMSPE) for an m� n matrix Athat was used in SVDD [21]. If aij is an original value, aij is a reconstructed value and �a is the mean value of a matrix A, RMSPEis defined as follows:

RMSPE ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPmi¼1

Pnj¼1ðaij � aijÞ2

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPm

i¼1

Pnj¼1ðaij � �aÞ2

q � 100

6.2. The accuracy and execution time for static data

We first show the accuracy and execution time for static data. Fig. 8 presents the accuracy for stock data and syntheticdata.

(a) Stock data

(b) Synthetic data

0%

20%

40%

60%

80%

10 15 20 25 30 35 40

Space Used(%)

RM

SPE

SVD SVDD SVDW AMID EWA

30%

40%

50%

60%

70%

80%

90%

10 15 20 25 30 35 40

Space Used(%)

RM

SPE

SVD SVDD SVDW AMID EWA

Fig. 8. The accuracy with static data.

Page 15: AMID: Approximation of MultI-measured Data using …islab.kaist.ac.kr/chungcw/amid_published.pdfAMID: Approximation of MultI-measured Data using SVD Jun-Ki Mina, Chun-Hee Leeb, Chin-Wan

J.-K. Min et al. / Information Sciences 179 (2009) 2833–2850 2847

In the result for the stock data shown in Fig. 8a, SVD is not applied when the storage space is less than 25% of the originaldata since the stock data has four measures and keeping one column needs 25% of the total space. In this case, SVDD, SVDW,and AMID store some values instead of columns for the stock data in the total space. SVDD stores the largest values from theerror matrix and SVDW stores the largest coefficients among the wavelet coefficients transformed from the error matrix.Note that in this case, the error matrix is the original matrix itself since any columns of U are not stored. Meanwhile, AMIDstores the largest coefficients among the wavelet coefficients transformed from U instead of the original matrix. Since U ismore correlated than the original matrix, we can represent U compactly with a small number of coefficients compared withA. Thus, AMID shows the best accuracy among all approaches when the storage space is less than 25% of the original data.

In the stock data, the first singular value is much greater than the others. Therefore, when the storage space is greaterthan 25% of the original data, the approaches based on SVD outperform extended wavelets. In addition, SVDD, SVDW andAMID are more accurate than SVD.

In the result of synthetic data shown in Fig. 8b, AMID is the most accurate among SVD based approaches. When the stor-age size is greater than 30% of original data, EWA outperforms the other approaches since EWA can keep the largest waveletcoefficients for each measure. However, when the storage space is small, AMID shows the best accuracy.

Fig. 9 shows the execution time with varying the space used. If m� n, the time complexity of SVD can be considered to bebound as OðmÞ. Thus, SVD is faster than any other approaches on both data sets.

0

0.4

0.8

1.2

1.6

2

10 15 20 25 30 35 40

Space Used(%)

Tim

e (s

)

SVD SVDD SVDW AMID EWA

(a) Stock data

(b) Synthetic data

0

1

2

3

4

5

6

10 15 20 25 30 35 40

Space Used(%)

Tim

e (s

)

SVD SVDD SVDW AMID EWA

Fig. 9. The performance with static data.

Page 16: AMID: Approximation of MultI-measured Data using …islab.kaist.ac.kr/chungcw/amid_published.pdfAMID: Approximation of MultI-measured Data using SVD Jun-Ki Mina, Chun-Hee Leeb, Chin-Wan

2848 J.-K. Min et al. / Information Sciences 179 (2009) 2833–2850

Since, in SVD, we compute U; V , and R regardless of the storage space, the cost of dropping columns is ignorable. There-fore, the performance of SVD is a constant time in spite of the increase of space. SVDW shows the worst performance sinceSVDW computes U; V , and R using SVD, generates the error matrix, and applies wavelets to the error matrix.

In Fig. 9, the execution times of most approaches are constant or increase with increasing the storage space. However, theexecution time of AMID decreases with increasing the space. In AMID, as the storage space increases, the number of droppedcolumn vectors decreases. Therefore, the execution time of AMID decreases as the storage space increases. On the otherhand, as the storage space increases, the reconstruction time of a matrix increases. Thus, the construction time of the errormatrix increases. Therefore, in contrast to AMID, the execution times of SVDD and SVDW increase.

6.3. The accuracy and execution time in incremental update environments

We conducted experiments for evaluating two versions of AMID for incremental update environments. For stock data, weconsider 1-year stock data as a single snapshot. We assume that the snapshot is stored 8 times. Therefore, 8-year stock datawill be stored. To measure the effect of data loss, we let the total disk space be the size of 1-year stock data. For syntheticdata, we generated 10 snapshots. A snapshot for synthetic data is a 3000� 20 matrix and the maximum storage space is3000� 20.

0%

5%

10%

15%

20%

25%

30%

1 2 3 4 5 6 7 8

The number of snapshots

RM

SPE

IncSVD LocAMID IncAMID

(a) Stock data

(b) Synthetic data

40%

45%

50%

55%

60%

65%

1 2 3 4 5 6 7 8 9 10

The number of snapshots

RM

SPE

IncSVD LocAMID IncAMID

Fig. 10. The accuracy with incremental updates.

Page 17: AMID: Approximation of MultI-measured Data using …islab.kaist.ac.kr/chungcw/amid_published.pdfAMID: Approximation of MultI-measured Data using SVD Jun-Ki Mina, Chun-Hee Leeb, Chin-Wan

J.-K. Min et al. / Information Sciences 179 (2009) 2833–2850 2849

Fig. 10 shows the accuracy with increasing the number of snapshots to be stored. When the number of snapshot is 1, theRMSPE is 0 since whole data can be stored without the loss of information. However, as shown in Fig. 10, RMSPE increases asthe number of snapshots increases since much more data is compressed with the fixed storage space.

In the experimental result with stock data shown in Fig. 10a, the accuracy of LocAMID is a little bit worse than those ofIncSVD and IncAMID (within 5% gap). As mentioned earlier, stock data is much correlated. Since LocAMID applies SVD toeach snapshot, LocAMID does not use the correlations over the snapshots. Thus, LocAMID does not show the good accuracyin Fig. 10a.

In contrast, the synthetic data is less correlated. Thus, in the experimental result with synthetic data shown in Fig. 10b,LocAMID shows the best accuracy (within 5% gap).

In addition, as mentioned above, the size of disk space is equal to the size of 1-year stock data and the stock data consistsof four measures. Thus, when the number of snapshots is less than 5, the dominant columns are kept entirely. In the case thatthe number of snapshots for stock data is 5, all entries in the dominant columns are not recorded. Therefore, in Fig. 10a, whenthe number of snapshots changes from 4 to 5, RMPSE increases largely.

Fig. 11 shows the execution time for stock data and synthetic data. As we expected, LocAMID shows the fastest executiontime on both stock data sets and synthetic data sets, and the execution time of the LocAMID is nearly constant. IncSVD showsthe worst running time since it requires the reconstruction of a matrix.

0

1

2

3

4

5

1 2 3 4 5 6 7 8

The number of snapshots

Tim

e (s

)

IncSVD LocAMID IncAMID

(a) Stock data

(b) Synthetic data

0

2

4

6

8

10

12

14

1 2 3 4 5 6 7 8 9 10

The number of snapshots

Tim

e (s

)

IncSVD LocAMID IncAMID

Fig. 11. The performance with incremental updates.

Page 18: AMID: Approximation of MultI-measured Data using …islab.kaist.ac.kr/chungcw/amid_published.pdfAMID: Approximation of MultI-measured Data using SVD Jun-Ki Mina, Chun-Hee Leeb, Chin-Wan

2850 J.-K. Min et al. / Information Sciences 179 (2009) 2833–2850

Consequently, for the highly correlated data, Incremental AMID shows the best accuracy. But, Local AMID achieves thebest performance over diverse data sets with reasonable accuracies.

7. Conclusion

In this paper, we propose AMID (Approximation of MultI-measured Data using SVD) to approximate the data with multi-ple measures.

Previous techniques for multi-measured data are based on wavelets. In contrast, AMID adapts the singular value decom-position (SVD) as a dominant compressor. Using the mathematical analysis, we derive the error bound of an individual datavalue that is not suggested in the wavelet techniques minimizing the L2 norm error.

In AMID, multi-measured data is decomposed into a column-orthogonal matrix, an orthogonal matrix, and a diagonal ma-trix that contains singular values. With respect to singular values, some column vectors are dropped. In order to fully utilizethe storage space and improve the accuracy, a wavelet transformation is applied to the dropped column vectors. And, accord-ing to the weights, some wavelet coefficients are retained.

For incremental update environments, we devise Incremental AMID and Local AMID that do not require the reconstruc-tion of previously archived data.

We implemented our AMID and diverse approaches for static data as well as Incremental AMID and Local AMID for incre-mental update environments. We conducted extensive experiments with real-life data and synthetic data. Experimental re-sults show that AMID is effective compared to the other approaches. In the experiments for incremental updateenvironments, Local AMID is superior to the other approaches in view of the performance and Incremental AMID showsthe best accuracy for the case that the data is correlated.

Acknowledgments

This work was partially supported by Defense Acquisition Program Administration and Agency for Defense Developmentunder the contract.

References

[1] P.G. Brown, P.J. Hass, Techniques for warehousing of sample data, in: Proc. of IEEE ICDE, 2006, p. 6.[2] G. Cormode, M. Garofalakis, D. Sacharidis, Fast approximate wavelet tracking on streams, in: Proc. of EDBT Conf., 2006, pp. 4–22.[3] A. Deligiannakis, N. Roussopoulos, Extended wavelets for multiple measures, in: Proc. of ACM SIGMOD Conf., 2003, pp. 229–240.[4] C. Eckart, G. Young, The approximation of one matrix by another of lower rank, Psychometrika 1 (3) (1936) 211–218.[5] D. Fuchs, Z. He, B.S. Lee, Compressed histograms with arbitrary bucket layouts for selectivity estimation, Information Sciences 177 (3) (2007) 680–702.[6] M. Garofalakis, P.B. Gibbons, Wavelet synopses with error guarantees, in: Proc. of ACM SIGMOD Conf., 2002, pp. 476–487.[7] M. Garofalakis, P.B. Gibbons, Probabilistic wavelet synopses, ACM Transactions on Database Systems (TODS) 29 (1) (2004) 43–90.[8] M. Garofalakis, A. Kumar, Deterministic wavelet thresholding for maximum-error metrics, in: Proc. of PODS, 2004, pp. 166–176.[9] P.B. Gibbons, Y. Matias, New sampling-based summary statistics for improving approximate query answers, in: Proc. of ACM SIGMOD Conf., 1998, pp.

331–342.[10] P.B. Gibbons, Distinct sampling for highly-accurate answers to distinct values queries and event reports, in: Proc. of VLDB Conf., 2001, pp. 541–550.[11] G.H. Golub, C.F.V. Loan, Matrix Computations, Johns Hopkins U. Press, 1996.[12] S. Guha, B. Harb, Wavelet synopsis for data streams: minimizing non-Euclidean error, in: Proc. of ACM SIGKDD Conf., 2005, pp. 88–97.[13] S. Guha, B. Harb, Approximation algorithms for wavelet transform coding of data streams, in: Proc. of SODA Conf., 2006, pp. 698–707.[14] S. Guha, C. Kim, K. Shim, XWAVE: optimal and approximate extended wavelets for streaming data, in: Proc. of VLDB Conf., 2004, pp. 288–299.[15] S. Guha, Space efficiency in synopsis construction algorithm, in: Proc. of VLDB Conf., 2005, pp. 409–420.[16] J.M. Hellerstein, P.J. Haas, H.J. Wang, Online aggregation, in: Proc. of ACM SIGMOD Conf., 1997, pp. 171–182.[17] Y. Ioannidis, V. Poosala, Balancing optimality and practicality for query result size estimation, in: Proc. of ACM SIGMOD Conf., 1995, pp. 233–244.[18] H.V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K.C. Sevcik, T. Suel, Optimal histograms with quality guarantees, in: Proc. of VLDB Conf., 1998,

pp. 275–286.[19] P. Karras, N. Mamoulis, One-pass wavelet synopses for maximum-error metrics, in: Proc. of VLDB Conf., 2005, pp. 421–432.[20] P. Karras, N. Mamoulis, The Haar+Tree: a refined synopsis data structure, in: Proc. of ICDE Conf., 2007, pp. 436–445.[21] F. Korn, H.V. Jagadish, C. Faloutsos, Efficiently supporting ad hoc queries in large datasets of time sequences, in: Proc. of ACM SIGMOD Conf., 1997, pp.

289–300.[22] D.C. Lay, Linear Algebra and its Applications, Addison Wesley Longman, 1999.[23] Y. Matias, J.S. Vitter, M. Wang, Wavelet-based histograms for selectivity estimation, in: Proc. of ACM SIGMOD Conf., 1998, pp. 448–459.[24] G. Piatetsky-Shapiro, C. Connell, Accurate estimation of the number of tuples satisfying a condition, in: Proc. of ACM SIGMOD Conf., 1984, pp. 256–276.[25] V. Poosala, Y.E. Ioannidis, Selectivity estimation without the attribute value independence assumption, in: Proc. of VLDB Conf., 1997, pp. 486–495.[26] G.R.R. Saint-Paul, N. Mouaddlib, General purpose database summarization, in: Proc. of VLDB Conf., 2005, pp. 733–744.[27] E.J. Stollnitz, T.D. DeRose, D.H. Salesin, Wavelets for Computer Graphics, Morgan Kaufman, 1996.[28] J.S. Vitter, Random sampling with a reservior, ACM Transactions on Mathematical Software (TOMS) 11 (1) (1985) 37–57.[29] R.R. Yager, D. Filev, Summarizing data using a similarity based mountain method, Information Sciences 178 (3) (2008) 816–826.[30] V.Y. Pan, Z.Q. Chen, The complexity of the matrix eigenproblem, in: Proc. of STOC Conf., 1999, pp. 507–516.