The Complexity of Massive Data Set Computations

Ziv Bar-Yossef

Computer Science Division

U.C. Berkeley

Ph.D. Dissertation Talk

May 6, 2002

What Are Massive Data Sets?Examples• The Web• IP packets• Supermarket transactions• Telephone call graph• Astronomical observations

Characterizing properties• Huge collections of raw data• Data is generated and modified continuously • Distributed over many sites• Slow storage devices• Data is not organized / indexed

Nontraditional Computational Challenges

Restricted access to the data• Random access: expensive• “Streaming” access: more feasible• Some data may be unavailable• Fetching data is expensive

Traditionally

Cope with the difficulty of the problem

Massive Date Sets

Cope with the size of the data and the

restricted access to it

Sub-linear running time • Ideally, independent of data size

Sub-linear space• Ideally, logarithmic in data size

Basic Framework

Massive data set computations are typically:• Approximate• Randomized• Have a restricted access regime

Input Data

Access Regime

AlgorithmApproximate Output$$

($$ = randomness)

Prominent Computational Models for Massive Data Sets

• Sampling Computations– Sub-linear running time & space– Suitable for “insensitive” functions

• Data Stream Computations– Linear running time, sub-linear space– Can compute sensitive functions

• Sketch Computations– Suitable for distributed data

Sampling Computations

Sampling Algorithm

Approximation of f(x1,…,xn)

• Query input at random locations• Can choose query distribution and can query adaptively• Complexity measure: query complexity

• Applications– Statistical parameter estimation– Computational and statistical learning [Valiant 84, Vapnik 98]– Property testing [RS96,GGR96]

Data Stream Computations[HRR98, AMS96, FKSV99]

x1 x2 x3 xn

Data Stream Algorithm$$ memory

• Input arrives in a one-way stream in arbitrary order• Complexity measures: space and time per data item

Approximation of f(x1,…,xn)

• Applications– Database (Frequency moments [AMS96])

– Networking (Lp distance [AMS96, FKSV99, FS00, Indyk 00])

– Web Information Retrieval (Web crawling, Google query logs [CCF02])

Sketch Computations[GM98, BCFM98, FKSV99]

compression compression compression

Sketch Algorithm

Approximation of f(x11,…,xtk)

• Algorithm computes from data “sketches” sent from sites• Complexity measure: sketch lengths• Applications

– Web Information Retrieval (Identifying document similarities [BCFM98])

– Networking (Lp distance [FKSV99])

– Lossy compression, approximate nearest neighbor

x11 … x1k x21 … x2k xt1 … xtk $$

Main Objective

• Develop general lower bound techniques

• Obtain lower bounds for specific functions

Explore the limitations of the above computational models

General CC lower bounds [BJKS02b]

Information Theory

Communication Complexity

Thesis Blueprint

Statistical Decision Theory

Sampling Computations

Data Stream Computations

Sketch Computations

lower bounds for general functions [BKS01,B02]

One-way and simultaneous CC

lower bounds [BJKS02a]

Reduction from simultaneous

Reduction from one-way CC

Sampling Lower Bounds(with R. Kumar, and D. Sivakumar, STOC 2001, and Manuscript, 2002)

• Combinatorial lower bound [BKS01]– bounds the expected query complexity of every function– tends to be weak– based on a generalization of Boolean block sensitivity [Nisan 89]

• Statistical lower bounds– bound the query complexity of symmetric functions– via Hellinger distance: worst-case query complexity [BKS01]– via KL distance: expected query complexity [B02]– tend to be tight– work by a reduction from statistical hypothesis testing

• Information theory lower bound [B02]– bounds the worst-case query complexity of symmetric functions– has better dependence on the domain size

Main observation:

Since for all x, w.p. 1 - , then:

x,y -disjoint T(x),T(y) are “far” from each other

Main Idea

)()( xCxT

)(xC )(yCapproximation

set of x

-disjoint inputs

)(wC approximation set of y

approximation set of w

1))()(Pr( xCxTapproximation:

Main Result

TheoremFor any symmetric f and -disjoint inputs x,y, and for any algorithm that ()-approximates f,• Worst-case # of queries 1/h2(Ux,Uy) log(1/)• Expected # of queries 1/KL(Ux,Uy) log(1/)

• Ux – uniform query distribution on x: (induced by: pick i u.a.r, output xi)

• Hellinger: h2(Ux,Uy) = 1 – a (Ux(a) Uy(a))½

• KL: KL(Ux,Uy) = a Ux(a) log(Ux(a) / Uy(a))

Example: Mean

1 0 1 0

½ + ½ - ½ - ½ + X: y:

h2(Ux,Uy) = KL(Ux,Uy) = O(2)

Theorem (originally, [CEG95])

Approximating the mean of n numbers in [0,1] to within additive error requires logqueries.

Other applications: Selection functions, frequency moments, extractors and dispersers

1. For symmetric functions, WLOG, all queries are uniform without replacement

2. If # of queries is n½, can further assume queries are uniform with replacement

3. For any -disjoint inputs x,y,

4. Hypothesis testing lower bounds • via Hellinger distance (worst-case)• via KL distance (expected) (cf. [Siegmund 85])

Proof Outline

approximation of f with k

queries

Hypothesis test of Ux against Uy with

error and k samples

Statistical Hypothesis Testing

Black Box

Hypothesis Test

k i.i.d. samples

• Black box contains either P or Q• Test has to decide: “P” or “Q”• Allowed error probability • Goal: minimize k

Sampling Algorithm Hypothesis Test

x,y: -disjoint inputs

Black Box

Sampling Algorithm

“Uy” - otherwise

“Ux” – if output

))()(( yCxC

k i.i.d. samples

Hypothesis test for Ux against Uy with error and

k samples

Lower Bound via Hellinger Distance

21),( ky

kx UUV

22 2 hhV k

kx UUhUUh )),(1(),(1 22

Lemma (cf. Le Cam, Yang 90)

12Corollary: k 1/h2(Ux,Uy) log(1/)

Communication Complexity [Yao 79]

f: X Y Z

x X y Yf(x,y)

R(f) = randomized CC of f with error

$$ Bob$$

Multi-Party Communication

f: X1 … Xt Z

f(x1,…,xt)

t-party set-disjointness

Example: Set-disjointness

iit SSji

1||1 Pi gets Si [n],

Theorem [KS87,R90]: R(Disj2) = (n)

Theorem [AMS96]: R(Disjt) = (n/t4)

Best upper bound: R(Disjt) =O(n/t)

Restricted Communication Models

P1 P2 Pt

Referee

P1 P2 Pt

f(x1,…,xt)

One-Way Communication [PS84, Ablayev 93, KNR95]

Simultaneous Communication [Yao 79]

• Reduces to data stream computations

• Reduces to sketch computations

Example: Disjointness Frequency

Moments

Fk(a1,…,am) = j [n] (fj)k

k-th frequency moment

knFDisj k /1Theorem [AMS96]:

Input stream: a1,…,am [n],

For j [n], fj = # of occurrences of j in a1,…,am

Corollary: DS(Fk) = n(1), k > 5

Best upper bounds: DS(Fk) = nO(1), k > 2

DS(Fk) = O(log n), k=0,1,2

Information Statistics Approach to Communication Complexity

(with T.S. Jayram, R. Kumar, and D. Sivakumar, Manuscript 2002)

Applications• General CC lower bounds

– t-party set-disjointness: (n/t2) (improving on [AMS96])– Lp (solving an open problem of [Saks-Sun 02])

– Inner product • One-way CC lower bounds

– t-party set-disjointness: (n/t1+ ) for any > 0• Space lower bounds in the data stream model

– frequency moments: n(1),k > 2 (proving conjecture of [AMS96])

– Lp distance

A novel lower bound technique for randomized CC based on statistics and information theory

Statistical View of Communication Complexity

– a -error randomized protocol for f: X Y Z(x,y) – distribution over transcripts

Lemma: For any two input pairs (x,y), (x’,y’) with f(x,y) f(x’,y’),

V((x,y),(x’,y’)) 1 – 2Proof:By reduction from hypothesis testing.

Corollary: h2((x,y),(x’,y’)) 1 – 2½

26CC lower bound

For a protocol that computes f, how much information does (x,y) have to reveal about (x,y)?

= (X,Y) – a distribution over inputs of f

Definition: -information cost icost() = I(X,Y ; (X,Y))

icost(f) = min{icost()}

I(X,Y ; (X,Y)) H((X,Y)) |(X,Y)|

Information cost lower bound

Information Cost[Ablayev 93, Chakrabarti et al. 01, Saks-Sun 02]

Direct Sum for Information CostDecomposable functions:

f(x,y) = g(h(x1,y1),…,h(xn,yn)),

h: Xi Yi {0,1}, g: {0,1}n {0,1}

Example: Set Disjointness Disj2(x,y) = (x1 Λ y1) V … V (xn Λyn)

Theorem (direct sum): For appropriately chosen ,’,

icost(f) n · icost’,(h)

Lower bound on icost(h)

Lower bound on icost(f)

Information Cost of Single-Bit Functions

In Disj2, ’ = ½ ’1 + ½ ’2, where:

’1 = ½(1,0) + ½(0,0), ’2 = ½(0,1) + ½(0,0)

Lemma 1: For any protocol for AND,

icost’() (h2((0,1),(1,0))

Lemma 2: h2((0,1),(1,0)) = h2((1,1),(0,0))

Corollary 1: icost’,(AND) (1 – 2½)

Corollary 2: icost(Disj2) (n · (1 – 2½))

Proof of Lemma 2“Rectangle” property of deterministic protocols:

For any transcript , the set of all (x,y) with (x,y) = is a “combinatorial rectangle”: S T, where S X and T Y

“Rectangle” property of randomized protocols:

For all x X, y Y, there exist functions px: {0,1}*[0,1] and qy: {0,1}*[0,1], such that for any possible transcript ,

Pr((x,y) = ) = px() · qy()

h2((0,1),(1,0)) = 1 - (Pr((0,1) = ) · Pr((1,0) = ))½

= 1 – (p0() · q1() · p1() · q0())½ = h2((0,0),(1,1))

Conclusions

• Studied limitations of computing on massive data sets– Sampling computations– Data stream computations– Sketch computations

• Lower bound methodologies are based on– Information theory– Statistical decision theory– Communication complexity

• Lower bound techniques:– Reveal novel aspects of the models– Present a “template” for obtaining specific lower bounds

Open Problems

• Sampling– Lower bounds for non-symmetric functions– Property testing lower bounds

• Communication complexity– Study the communication complexity of approximations– Tight lower bound for t-party set disjointness– Under what circumstances are one-way and

simultaneous communication equivalent?

Thank You!

Yao’s Lemma [Yao 83]

Definition: -distributional CC (D(f))

Complexity of best deterministic protocol that computes f with error on inputs drawn according to

Yao’s Lemma: R(f) maxD(f)

• Convenient technique to prove randomized CC lower bounds

Communication Complexity Lower Bounds via Information Theory

(with T.S. Jayram, R. Kumar, and D. Sivakumar, Complexity 2002)

• A novel information theory paradigm for proving CC lower bounds

• Applications– Characterization results: (w.r.t. product distributions)

• 1-way simultaneous • 2-party 1-way t-party 1-way • VC dimension characterization of t-party 1-way CC

– Optimal lower bounds for simultaneous CC• t-party set-disjointness: (n/t) • Generalized addressing function

Information Theory

sender receivernoisy channelm M r R

• M – distribution of transmitted messages

• R – distribution of received messages

• Goal of receiver: reconstruct m from r

• g: error probability of a reconstruction function g

Fano’s Inequality: For all g, H2(g) H(M | R)

MLE Principle: MLE H(M | R)

For a Boolean M

Information Theory View of Distributional CC

• x,y distribute according to (X,Y)• “God” transmits f(x,y) to Alice & Bob• Alice & Bob receive the transcript (x,y)

• Fano’s inequality:

For any -error protocol for f,

H2() H(f(X,Y) | (X,Y))

f(x,y) (x,y)“God”

Alice & BobCC protocol

Simultaneous CC vs. One-Way CC

Theorem

For every product distribution = X Y, and every Boolean f,

D,2H(),sim(f) D,,AB(f) + D,,BA(f)

A(x) – message of A on x in a -error A B protocol for f

B(y) – message of B on y in a -error B A protocol for f

Construct a SIM protocol for f:

A Referee: A(x) B Referee: B(y)

Referee outputs MLE(f(X,Y) | A(x), B(y))

Simultaneous CC vs. One-Way CCProof (cont.)

By MLE Principle,

Pr(MLE(f(X,Y) | A(X),B(Y)) f(X,Y)) H(f(X,Y) | A(X),B(Y))

By Fano,

H(f(X,Y) | A(X),Y) H2() and H(f(X,Y) | X,B(Y)) H2()

Lemma For independent X,Y,

H(f(X,Y) | A(X),B(Y)) H(f(X,Y) | A(X),Y) + H(f(X,Y) | X,B(Y))

Our protocol errs with probability at most 2H2() □

The Complexity of Massive Data Set Computations

Documents

Transcript of The Complexity of Massive Data Set Computations

Computations Salary

Signal Processing in Massive MIMO Systems Realized …Signal Processing in Massive MIMO Systems ... Signal Processing in Massive MIMO Systems Realized with Low Complexity Hardware

Visibility Computations

COMPLEXITY OF PARALLEL MATRIX COMPUTATIONS · 2016-12-09 · other matrix computations, such as triangufar and QR-factorizations of a matrix and its reduction to Hessenberg form.

high performance medical reconstruction using …...I. Introduction The complexity of Medical image reconstruction requires tens to hundreds of billions of computations per second.

Complexity ©D. Moshkovitz 1 And Randomized Computations The Polynomial Hierarchy.

What’s Behind 5G Wireless Communications? · Hybrid Beamforming for Massive MIMO Different realizations have different complexity tradeoffs Beamforming partitioned between digital

Complexity Issues in Computing Spectra, Pseudospectra and ...damtp.cam.ac.uk/research/afha/anders/Banach_Hansen_Nevanlinna_… · We display methods that allow for computations of

Parallel algorithms for certain matrix computations I - … · The complexity of performing matrix computations, such as solving a linear system, inverting a nonsingular matrix or

Precision Computations

Low-complexity Location-aware Multi-user Massive MIMO ... · beamforming complexity. According to [15], 86.5% of the Chinese Beijing-Shanghai high speed railway is viaduct, which

Computational techniques in animal breedingnce.ads.uga.edu/~ignacy/course2002/notes.pdfAlgorithms usually have separate complexity for memory and computations. Often a memory-efficient

Zigbee Computations

Adjustment Computations

pdfs.semanticscholar.orgpdfs.semanticscholar.org/3086/d8318239d7a1aa8c93a2387e33dc0d… · PARALLEL COMPLEXITY OF COMPUTATIONS WITH GENERAL AND TOEPLITZ-LIKE MATRICES FILLED WITH

Low-Complexity Antenna Selection Techniques for Massive ...

On the complexity of computations modulo zero-dimensional …poteaux/fichiers/linz-trig-sets.pdf · 2015-09-07 · On the complexity of computations modulo zero-dimensional triangular

Some Exact Complexity Results for Straight-Line ...jukna/Jerrum-Snir.pdfSome Exact Complexity Results for Straight-Line Computations over Semirings MARK JERRUM AND MARC SNIR Umverslty

Space-based approach to high throughput computations in ... · UNICORE Summit, August 26, 2008 Using docking for virtual screening • Docking is very well suited for massive parallelization:

On the nature of linguistic computations: complexity ... · On the nature of linguistic computations: complexity, development, and evolution. Luigi Rizzi University of Siena