F AST A PPROXIMATE C ORRELATION FOR M ASSIVE T IME - SERIES D ATA SIGMOD’10 Abdullah Mueen, Suman...

FAST APPROXIMATE CORRELATION FOR MASSIVE TIME-SERIES DATA

SIGMOD’10

Abdullah Mueen, Suman Nath, Jie Liu

OUTLINE

Introduce Motivation Reducing I/O Costs Reducing Computational Cost Evaluation Conclusion

INTRODUCE

The increasing instrumentation of physical and computing processes has given us unprecedented capabilities to collect massive volumes of data.

Data center management,Environmental monitoring ,financial engineering Stream warehousing systems (SWS)

Minimizing response times of ad-hoc queries in an SWS is very important for effective user interaction.

(CONT.)

A typical SWS, such as DataGarage, monitors multiple data centers, each of which may contain tens of thousands of servers.

In this paper, we consider a more complex query of computing a correlation matrix.

(I)n signals of equal length m(II)compute a nxn matrix C(III) such that C[i; j]; 1 <=i, j <= n is

the correlation coefficient corr(i , j) of signals i and j

PEARSON’S CORRELATION COEFFICIENTS

(a) Correlation Matrix C

Figure 1: Correlation matrices. Darker pixels represent higher correlation coefficients (e.g., a black pixel represents a corr. coeff. of 1).

(CONT.)

Fast computation of a large correlation matrix in this setting is challenging because of its high I/O and CPU overhead. High I/O costs High CPU costs

To address the above challenges, we consider a slightly relaxed, yet almost equally useful, version of the problem

(CONT.)

Threshold Correlation Matrix

(I)n signals of equal length m(II)compute a nxn matrix C(III) such that C[i; j] ; 1 <=i, j <= n

is the correlation coefficient corr(i , j) of signals i and j.

(b) Threshold Corr. Matrix CT

(a) Correlation Matrix C

Figure 1: Correlation matrices. Darker pixels represent higher correlation coefficients (e.g., a black pixel represents a corr. coeff. of 1).

MOTIVATION

REDUCING I/O COSTS

To reduce I/O costs

we propose a novel data partitioning algorithm that divides a given dataset into batches such that each batch fits in the available memory

REDUCING I/O COSTS

A naïve algorithm would require computing correlation of all pairs of n signals and have an O( ) I/O cost.

We can reduce the cost in at least two ways. Pruning：

How does the algorithm know which signals are correlated?

Intelligent Caching：How can one find a good caching strategy to minimize I/O costs?

(CONT.)IDENTIFYING CORRELATED PAIRS

We use Discrete Fourier Transform (DFT) to identify correlated signal pairs in an I/O efficient manner.

As the following lemma suggests, the correlation coefficient of signals can be reduced to the Euclidean distance between their normalized series.

(CONT.)LEMMA 1 & LEMMA 2

Lemma 1 ：

Lemma 2(higher than threshold) ：

(CONT.)

By ignoring such pairs, we will get a set of likely correlated signal pairs.

This is a superset of the correlated signal pairs, but there will be no false negatives.

In real-world, the first few low frequency DFT coefficients are sufficient to capture the overall shape of a signal.

Since k << m, we can maintain the coefficients in cache.

(CONT.)CACHING STRATEGY

(CONT.)

REDUCING CPU COSTS

Two novel approximation algorithms.

First approximation algorithm computes correlation coefficients within a given error bound

Second algorithm identifies signal pairs with correlation above a threshold without any false positives or false negatives.

(CONT.)TWO FACTS

It is often sufficient to report correlation coefficients within a small error bound.

It may often be sufficient to only identify all and only the pairs with correlation above a given threshold; knowing the exact correlation coefficients is not strictly necessary.

(CONT.) APPROXIMATE THRESHOLD CORRELATIONMATRIX

(I)n signals of equal length m ,a threshold T,and an error bound

(II)compute a nxn matrix C(III) such that C[i; j]; 1 <=i, j <= n is the

correlation coefficient corr(i , j) of signals i and j

(CONT.)

Figure 1(b)

compare

(CONT.) APPROXIMATION WITH PREFIX DISTANCE

Thus, one way to approximate corr(x; y) is to use an approximate value for d(bx;by ); e.g., dk(bx;by), for k < m.

For significant savings in computational complexity, it is important that k << m.

Unfortunately such approximation does not work well in the time domain.

(CONT.)

We therefore approximate distance in the frequency domain.

Since DFT is a linear transformation, the Euclidean distance is preserved under the transformation.

Therefore This allows us to rephrase Lemma 1 as

follows:

(CONT.)

Thus, to minimize overall computation cost while guaranteeing a given approximation error bound, we need to compute the smallest number k of DFT coefficients that ensure the target accuracy.

Our main result in this section shows a relationship between the approximation error in a correlation coefficient and the number of leading DFT coefficients used for approximating the coefficient

(CONT.)ENERGY OF A SIGNAL

The energy of a signal x is defined as

Lemma 3

(CONT.)LEMMA 4

(CONT.)

THRESHOLD BOOLEAN CORRELATION MATRIX

(I)n signals of equal length m ,a threshold T(II)Compute a nxn matrix C

• The second type of approximation we consider is to identify if a pair of signals has correlation above a given threshold T.

(CONT.)

Figure 1(b)

compare

(CONT.)LEMMA 5

The proof of the lemma follows Lemma 2. Obviously, the tighter the upper and the lower bounds are, the more likely it is to determine the correlation status of a pair.

how do we efficiently find good bounds on distances of two signals?

(CONT.)LEMMA 6

Unfortunately, apart from being computationally expensive, the resulting upper bounds are not tight and are practically useless because a random reference signal is expected to be equally far from all of the signals in a very high (= m) dimensional space.

(CONT.) A DYNAMIC PROGRAMMING ALGORITHM

Ideally, for a pair of signal x and y, we want a third signal v (called a verifier) that is either

very close to both x and y implying that x and y are likely to be close, and hence correlated to each other

very close to x but not to y (or vice versa), implying that x and y are not correlated.

EVALUATION

CONCLUSION

We have proposed novel algorithms, based on Discrete Fourier Transform (DFT) and graph partitioning, to reduce the end-to-end response time of an all-pair correlation query.

F AST A PPROXIMATE C ORRELATION FOR M ASSIVE T IME - SERIES D ATA SIGMOD’10 Abdullah Mueen, Suman...

Documents

Transcript of F AST A PPROXIMATE C ORRELATION FOR M ASSIVE T IME - SERIES D ATA SIGMOD’10 Abdullah Mueen, Suman...

TACO: T unable A pproximate C omputation of O utliers in Wireless Sensor Networks

C AUSAL I NFERENCE FROM D ESCRIPTIONS OF E XPERIMENTAL AND N ON - EXPERIMENTAL R ESEARCH : P UBLIC U NDERSTANDING OF C ORRELATION -V ERSUS -C AUSATION.

NMR Spectroscopy CHEM 430. P ROTON -P ROTON C ORRELATION T HROUGH J-C OUPLING 2D NMR Basics. In actuality, the techniques we have already covered 1 H,

Per Su Assive

NTEGRATING ASSIVE ONTROLS IN THE E R P · alternative passive air pollution mitigations measures, those hidden in plain sight?! gallagher et al., 2019 (in preparation) air pollution

Semiclassical C orrelation in D ensity-Matrix Dynamics

P assive Power Filters - CERN Document Server · 2015-07-29 · P assive Power Filters R. Künzi Paul Scherrer Institute, Villigen, Switzerland . Abstract . Power converters require

M assive O pen O nline S upport for E ducation

Optical Illusion / Optical Truth - Epsilon Theory › wp-content › uploads › epsilon... · 2020-05-03 · “Optical Illusion / Optical Truth” Edward Tufte, “ orrelation

S EMINAR ON P ASSIVE & A CTIVE D ESIGN FOR E NERGY E FFICIENT B UILDINGS 3 October 2014 Holiday Inn Resort, Penang Part 1.

EVERY BODY WANTS P assive Income (ROYALTY) Healthy Bank Balance Luxurious Car World Tour Big House " Together we achieve that which.

Passive Damping Mechanism of Herschel-Quincke …...Master of Applied Science in Mechanical Engineering Thesis title: P assive Damping Mechanism of Herschel -Qu incke Tubes for P ressure

Revolutionising the way you trade commodities. …...The commodities trading market is the biggest trade market on tigitalisation assive amounts of oil, metals, agricultural products

Class-X G RAMMAR U NDERSTANDING A CTIVE AND P ASSIVE V OICE The Path to Effective Writing.

LEM Residential MetaData Summary Appendix : orrelation Plots › media › 4639 › lem-residential-metadata-sum… · this process, there are a significant number of plots included

I NFINITIVES S EEM + INFINITIVE P ASSIVE + INFINITIVE Jelena Basta e-mail: jelena.basta@eknfak.ni.ac.rsjelena.basta@eknfak.ni.ac.rs.

Half-Assive or Fully-Passive?

ASSIVE POWER 1 Game Modes - Arena: the Contestarenathecontest.com/wp-content/uploads/docs/arena_quickstart_v12... · - Check the Quest Guide and Campaign Tome for more information

I NCREMENTAL M AINTENANCE OF L ENGTH N ORMALIZED I NDEXES FOR A PPROXIMATE S TRING M ATCHING - Ashwin Joshi 1.

UTAH LEAGUE OF CITIES & TOWNS BOARD OF DIRECTORS …site.utah.gov/.../sites/4/2017/06/ULCT-Board-Materials-December-9-2… · 9, 2016 @ 9:00 AM (T. IMES ARE . A. PPROXIMATE) 1. Welcome