Random Projection, Generative Lattices, and Redundancy to...

Post on 14-Jul-2020

0 views 0 download

Transcript of Random Projection, Generative Lattices, and Redundancy to...

Random Projection, Generative Lattices, andRedundancy to Combat Scalability Bounds In

Distributed Computing

Lee A. Carraher

School of Electronic and Computing SystemsUniversity of Cincinnati

March 3, 2014

Introduction

I From CincinnatiI B.S. Computer Engineering (UC 2008)I M.S. Computer Science (UC 2012)

I Advisor Prof. Fred AnnexsteinI Ph.D Computer Engineering (ongoing)

I Advisor: Prof. Fred Annexstein

2 / 60

Interests

Some research interests:I Machine Learning (bioinformatics, filtering)I Inverse Problems (min/max problems)I Parallel Computing (CUDA, MPI)I Distributed Computing (Mapreduce, Spark)I Big Data

3 / 60

Motivations

“What are the important problems of your field?”- Richard Hamming

4 / 60

Data Storage is Cheap

I Store Everything Because it’s cheap (NSA...)

5 / 60

Big Problems in Computing

Two problems:1. We are storing more data than we can effectively process

(n→∞)2. Stagnated Clock speeds

I materials problemI energy problemI fundamental cooling problem (Landauer’s Principle)

6 / 60

Ways To Attack Computing Problems

Faster Processors

More Processors

Better Algorithms

7 / 60

Parallel Computing

Simple, Add more processors!Basic Issues

I Communication BottlenecksI Algorithmic Bottlenecks

8 / 60

Scaled Speedup Example

0 50 100 150 200Parallel Speedup

5

10

15

20

25

30

Tota

l Spe

edup

Query: Total Speedup as a function of Parallel Speedup

ParallelismOur Theoretical Speedup (93.4%)90% Serial Code95% Serial Code97% Serial CodeTheoretical Speedup

I 80/20 rule (Pareto) isn’t even on here!

9 / 60

Ways To Attack Computing Problems

Faster Processors

More Processors

Better Algorithms

10 / 60

Has Moore’s Law Stalled?

I Clock speed is dead.11 / 60

Ways To Attack Computing Problems

Faster Processors

More Processors

Better Algorithms

12 / 60

Better Algorithms?

I This attack plan is not well definedI Some algorithms are optimal

13 / 60

Ways To Attack Computing Problems

Faster Processors

More Processors

Better Algorithms

I What about the overlaps?

14 / 60

RPHASH

Random Projection Hashing Goals:I ScalabilityI Minimize Communication Complexity

Tradeoff:I RedundancyI Accuracy

15 / 60

Related Work

Database Clustering MethodsI DBScanI CliqueI CLARANSI Proclus

16 / 60

Background

I Big DataI CODI Locality Sensitive Hash FunctionsI Space PartitioningI Lattices

I A Decoding ExampleI Leech LatticeI Leech Decoder

I Functional ProgrammingI MR/HadoopI Random Projection

17 / 60

Working Definition of “Big Data”

Sales and Commercial hype aside,

Definition (Big Data)A set of data processing problems in which the required data istoo large to reside in main memory.Thrashing between MM and HD(even solid state) . unscalablealgorithm

I Health MetricsI DNA SequencesI Website/Click Metrics

18 / 60

COD

Curse of DimensionalityCOD is sometimes cited as the cause for the distance functionloosing its usefulness for high dimensions. This arises from theratio of metric space partitioning to hypersphere embedding.

limd→∞

Vol(Sd )

Vol(Cd )=

πd/2

d2d−1Γ (d/2)→ 0

Given a single distribution, the minimum and the maximumdistances become indiscernible. Or the relative majority ofspace is outside of the sphere

19 / 60

LSH Hash Families

Definition (Locality Sensitive Hash Function)let H = {h : S → U} is (r1, r2,p1,p2)−sensitive if for anyu, v ∈ S

1. if d(u, v) ≤ r1 then PrH[h(u) = h(v)] ≥ p1

2. if d(u, v) > r2 then PrH[h(u) = h(v)] ≤ p2

For this family ρ = log p1log p2

20 / 60

An Example Hash Family

.

.

.

.

.

.

.

.

.

.

0

1

2

3

...

w

w

w

r

Figure : Random Projection of R2 → R1

21 / 60

Voronoi Tiling

Voronoi partitioning is optimal in 2D.

Figure : Voronoi Partitioning of R2

22 / 60

Voronoi

I Voronoi diagrams make for very efficient hash functions in2d because, by definition, a point within a Voronoi region isnearest to the regions representative point.

I Voronoi regions provide an optimal solution to the NNpartitioning in 2-d Space!

I However, for arbitrary dimension d, Voronoi diagramsrequire Θ(nd/2)-space, and no known optimal pointlocation algorithms exists.

23 / 60

Lattices

Instead we will consider lattices, which provide regular spacepartitioning and scale to arbitrarily large dimensional space,and have sub-linear nearest center search algorithmsassociated with them.

Definition (Lattice in Rn)let v1, ..., vn be n linear independent vectors wherevi = vi,1, vi,2, ..., vi,n The lattice Λ with basis {v1, ..., vn} is the setof all integer combinations of v1, ..., vn the integer combinationsof the basis vectors are the points of the lattice.

Λ = {z1v1 + z2v2 + ...+ znvn|zi ∈ Z,1 ≤ i ≤ n}

24 / 60

Examples in 2D

Figure : Square(left) and Hexagonal(right) Lattices in R2

25 / 60

Constant Time Decoding

I Certain lattices allow us to find the nearest representativepoint in constant time.

I For example the above square lattice.I The nearest point can be found by simply rounding our real

valued point to its nearest integer.I With the exception of a few exceptional lattices, more

complex lattices have more complex searches (exponentialas d increases ).

26 / 60

Exceptional Higher Dimensional Lattices

The previous lattices work well in R2, but our data spaces are ingeneral� 2.

I fortunately there are some higher dimensional lattices, withefficient nearest center search algorithms.

I E8 or Gosset’s Lattice, is one such lattice inI it is also the densest lattice packing in R8.

27 / 60

Example of decoding E8

E8 can be formed by gluing two D8 integer lattices together andshifting by a vector of 1

2 . This gluing of less dense lattices andshifting by a “glue vector” is a common theme in finding denselattices.

I Decoding D8 is simpleI E8 = D8 ∩ D8 + 1

2I both cosets of D8 can be computed in parallelI D8’s decoding algorithm consists of rounding all values to

their nearest integer value s.t they sum to an even number

28 / 60

Example of decoding E8

define f (x) and g(x) to round the components of x, except ing(x) we round the furthest value from an integer in the wrongdirection.let

x =< 0.1,0.1,0.8,1.3,2.2,−0.6,−0.7,0.9 >

thenf (x) =< 0,0,1,1,2,−1,−1,1 >, sum = 3

andg(x) =< 0,0,1,1,2,0,−1,1 >, sum = 4

I since g(x) is even, it is the nearest lattice point in D8

29 / 60

Example of decoding E8 conti..

I Include the coset ∩D8 + 12 .

I We can do this by subtracting 12 from all the values of x.

f (x − 12

) =< 0,0,0,1,2,−1,−1,0 >, sum = 1

g(x − 12

) =< -1,0,0,1,2,−1,−1,0 > sum = 0

Now we find the coset representative that is closest to x using asimple distance metric.

||x − g(x)||2 = 0.65

||x − g(x12

)||2 = 0.95

So this case it is the first coset representative:

< 0,0,1,1,2,0,−1,1 >

30 / 60

Leech Lattice

By gluing sets of E8 together in a way originally conceived byCurtis’ MOG, we can get an even higher dimensional denselattice called the Leech lattice.

31 / 60

Leech Lattice

Here we will state some attributes of the leech lattice as well asgive a comparison to other lattices by way of Eb/N0 and thecomputational cost of decoding.Some Important Attributes:

I Densest Regular Lattice Packing in R24

I Lattice Construction can be based on 2 cosets G24

I Sphere Packing Density: π12

12! ≈ 0.00192957I Kmin = 196560

32 / 60

Leech Lattice

0 5 10 15 20 25EB/NO (dB)

�5

�4

�3

�2

�1

Prob

abili

ty o

f Bit

Erro

r - lo

g10(

Pb)

Error Correcting Performance (3072000 Samples)

Leech hex -.2db and QAM16E8 2PSKUnencoded QAM64Unencoded QAM16

Figure : Performance of Some Coded and Unencoded DataTransmission Schemes

33 / 60

Leech Lattice Decoding

Some information about the decoding algorithm:I The decoding of the leech lattice is based closely on the

Decoding of the Golay Code.I In general, advances in either Leech decoding or binary

Golay decoding imply an advance in the other.I The decoding method used in this implementation is based

on Amrani and Be’ery’s ’96 publication for decoding theLeech lattice, and consists of around 519 floating pointoperations and suffers a gain loss of only 0.2bB.

I In general decoding complexity scales exponentially withdimension.

Next is an outline of the decoding process.

34 / 60

Leech Decoder

QAM Decoder

A Points

Block Confidence

Even Reps

Min Over H6

Even Reps

H-Parity

Even

K-Parity

Even

QAM Decoder

B Points

Block Confidence

Odd Reps

Block Confidence

Even Reps

Block Confidence

Odd Reps

Min Over H6

Odd Reps

H-Parity

Odd

Min Over H6

Even Reps

Min Over H6

Odd Reps

H-Parity

Even

K-Parity

Odd

H-Parity

Odd

Compare Candidate Codewords

24 Reals Received

K-Parity

Odd

K-Parity

Even

Figure : Leech Lattice Decoder

35 / 60

Functional Programming

Figure : Scargill, by Tim (timble.me.uk/blog/author/tim)

Parallel and Functional ProgrammingI Instead of moving the mountain to the people,Move the the

people to the mountainsI Where mountains are data and people are functions

respectively

36 / 60

Map Reduce Program Design

Figure : Map Reduce (courtesy: map-reduce.wikispaces.asu.edu

37 / 60

Hadoop

Hadoop is an open source implementation of the map reduceframework created and maintained by the Apache SoftwareGroup. Benefits

I Open SourceI Popular, and MaintainedI FreeI Implemented and compatible with Amazon EC2I Takes care of the networking and fault tolerance drudgery

of parallel system programming.

38 / 60

Hadoop Parallel System Design

Figure : Hadoop System Design (ibm.com/developerworks)

39 / 60

EC2 Cloud

Cloud and Distributed ServicesI Scalable to data processing problems needsI Very Low Cost Processing ModelI Always Up to Data HW ResourcesI Zero HW maintenance and overhead costs

40 / 60

Mahout

Mahout is an open source library of machine learningalgorithms made for Hadoop.Clustering Algorithms:

I Canopy ClusteringI K-MeansI Mean ShiftI LDAI MinHash

41 / 60

Random Projection

Figure : Random Projection: R3 → R2

42 / 60

Random Projection

Theorem (JL - Lemma)

(1− ε)‖u − u′‖2 ≤ ‖f (u)− f (u′)‖2 ≤ (1 + ε)‖u − u′‖2

ε - is an distortion quantityu,u′ ∈ U - two independent vectorsf - a random projection mapping

Rd → Rl

l ∝ Θ(log(n)

ε2log(1/ε))

43 / 60

FJLT

Fast Johnson-Lindenstrauss TransformI New and cool!I Using Heisenberg Uncertainty in Harmonic Analysis, a

spectrum and its signal cannot both be concentratedI Precondition projection with DFT, (some matrices have

very fast DFTs)I ≈ Θ(d log(d) + ε−3 log2(n)) vs Θ(dε−2 log(n))

44 / 60

Motivations of RP Hash

I Parallel Structure of a scalable parallel algorithm (LogReduce)

I Low Communication Overhead (Hashes)I Non -Parallel Iterative Structure (Per core redundancy)I Approximation is usually good enough

45 / 60

General Idea of RPHash

Figure : Multi-Probe Random Projection Intersection Probabilities

I generative space quantizationI random projectionI sequential multi-probe stochastic process

46 / 60

Occultation Problem

The occultation problem is the probability of two or moreindependent distributions overlapping in projected space.

I based on the distribution variance and angle of theprojective plane

I Applicable bounds from Urruty ’07. d is number of probes

I limd→∞

1−(

2(r1+r2)π‖d−c‖

)d

I In RPHash, d is the dimensionality (24).I RPHash projections are orthogonal

47 / 60

Sequential algorithm

1. Generate Random Projection matrix P2. Maintain DBcount of hash id’s3. Maintain DBcent Array of centroids corresponding to vectors4. Forall x ∈ X :

4.1 index = LatticeDec(xP>)4.2 DBcount [ID]++4.3 DBcent [ID]+ = x

5. sort[DBcount ,DBcent ]6. return DBcent [0 : k ]

48 / 60

Comparison with standard k-means

49 / 60

Sequential Algorithm Time Results

5000 10000 15000 20000Number of Vectors

2

0

2

4

6

Ela

pse

d T

ime (

s)

Time on Gaussian Data

RP Hash Mean: y = -0.101 + 0.0002xK-Means Mean: y = -0.050 + 8.3470E-5xRP HashK-Means

4000 5000 6000 7000 8000 9000 10000 11000 12000 13000Number of Dimensions

0

2

4

6

8

10

Ela

pse

d T

ime (

s)

Time on Gaussian Data

RP Hash Mean: y = .3730 + .0006xK-Means Mean: y = .0963 + .0001xRP HashK-Means

50 / 60

Sequential Algorithm Accuracy Results

0 1 2 3 4 5 6 7 8 9Number of Clusters

0.70

0.75

0.80

0.85

0.90

0.95

1.00

PR

Perf

orm

ance

Precision Recall on 50:50 Gaussian Data

RP Hash Mean: y = .8803 - .0155xK-Means Mean: y = .9963 - .0189xRP HashK-Means

0 2000 4000 6000 8000 10000 12000 14000 16000Dimensions

0.70

0.75

0.80

0.85

0.90

0.95

PR

Perf

orm

ance

Precision Recall on 50:50 Gaussian Data

RP Hash Mean: y = .7942 + 3.885E-7xK-Means Mean: y = .8681 - 1.040E-6xRP HashK-Means

51 / 60

Scalability Goal

I It would be nice if our algorithm scaled with the number ofprocessing nodes

I lets look at an algorithm that scales well and try to apply itto our problem

I the canonical hadoop ”Hello World” simulacrum ”WordCount” is a good place to start

52 / 60

Hadoop Word Count

53 / 60

Hadoop Word Count Scalability

54 / 60

Basic Outline

I Each processing node computes hashes and countsI Share the top k buckets to all computersI Aggregate all centroid averages

A Trick to minimize communication and storage reqrmnt.I Two Phase:I Phase 1: Only store counts and communicate IDsI Phase 2: Only accept hash collisions with Phase 1’s top

IDs for all clusters

55 / 60

Parallel Algorithm Phase 1

beginX = {x1, ..., xn}, xk ∈ Rm - data vectorsD - set of available compute nodesH - is a d dimensional LSH functionX̃ ⊆ X - vectors per compute nodePm→d - Gaussian projection matrixCs = {∅} - set of bucket collision countsforeach xk ∈ X̃ do

x̃k ←√

md P

ᵀxk

t = H(x̃k )Cs[t ] = Cs[t ] + 1

endsort({Cs,Cs.index}) return {Cs,Cs.index}[0 : k log(n)]

end

56 / 60

Parallel Algorithm Phase 2

beginX = {x1, ..., xn}, xk ∈ Rm - data vectorsD - set of available compute nodes{Cs,Cs.index} - set of klogn cluster IDs and countsH - is a d-dimensional LSH functionX̃ ⊆ X - vectors per compute nodepm→d ∈ P - Gaussian projection matricesC = {∅} - set of centroidsforeach xk ∈ X̃ do

foreach pm→d ∈ P̃ do

x̃k ←√

md pᵀxk

t = H(x̃k ) if t ∈ Cs.index thenC[t ] = C[t ] + xk

endend

endreturn C

end

57 / 60

What to Use this For?

1.G

SM

6531

71.

GS

M65

330

3.G

SM

6531

61.

GS

M65

361

1.G

SM

6536

81.

GS

M65

349

1.G

SM

6578

21.

GS

M65

334

1.G

SM

6536

51.

GS

M65

362

1.G

SM

6534

61.

GS

M65

355

1.G

SM

6577

61.

GS

M65

815

1.G

SM

6575

41.

GS

M65

773

1.G

SM

6587

33.

GS

M65

860

3.G

SM

6586

73.

GS

M65

859

1.G

SM

6532

91.

GS

M65

336

3.G

SM

6533

81.

GS

M65

779

1.G

SM

6583

01.

GS

M65

753

1.G

SM

6580

01.

GS

M65

363

1.G

SM

6537

91.

GS

M65

875

1.G

SM

6531

81.

GS

M65

781

1.G

SM

6576

71.

GS

M65

798

1.G

SM

6578

51.

GS

M65

791

1.G

SM

6579

71.

GS

M65

320

1.G

SM

6581

21.

GS

M65

771

1.G

SM

6581

41.

GS

M65

871

1.G

SM

6587

41.

GS

M65

787

1.G

SM

6533

23.

GS

M65

848

1.G

SM

6532

31.

GS

M65

783

1.G

SM

6577

01.

GS

M65

807

1.G

SM

6578

91.

GS

M65

331

1.G

SM

6531

91.

GS

M65

780

1.G

SM

6535

11.

GS

M65

360

1.G

SM

6534

51.

GS

M65

375

1.G

SM

6535

31.

GS

M65

358

3.G

SM

6536

41.

GS

M65

377

1.G

SM

6537

33.

GS

M65

366

1.G

SM

6537

63.

GS

M65

356

3.G

SM

6535

93.

GS

M65

372

3.G

SM

6535

43.

GS

M65

370

3.G

SM

6575

53.

GS

M65

756

3.G

SM

6583

73.

GS

M65

324

3.G

SM

6532

53.

GS

M65

322

3.G

SM

6532

13.

GS

M65

760

1.G

SM

6534

83.

GS

M65

831

3.G

SM

6534

33.

GS

M65

367

1.G

SM

6534

73.

GS

M65

326

3.G

SM

6537

81.

GS

M65

340

3.G

SM

6535

73.

GS

M65

795

1.G

SM

6580

53.

GS

M65

339

3.G

SM

6533

53.

GS

M65

769

3.G

SM

6536

93.

GS

M65

333

3.G

SM

6581

93.

GS

M65

371

3.G

SM

6535

23.

GS

M65

374

3.G

SM

6534

43.

GS

M65

350

202917_s_at : S100A8 220414_at : CALML5 204475_at : MMP1 201195_s_at : SLC7A5 203915_at : CXCL9 204533_at : CXCL10 210163_at : CXCL11 211122_s_at : CXCL11 201291_s_at : TOP2A 203560_at : GGH 202779_s_at : UBE2S 217755_at : HN1 201890_at : RRM2 209773_s_at : RRM2 211762_s_at : KPNA2 203213_at : CDC2 218883_s_at : MLF1IP 218585_s_at : DTL 201292_at : TOP2A 218662_s_at : NCAPG 203554_x_at : PTTG1 210052_s_at : TPX2 203764_at : DLGAP5 209714_s_at : CDKN3 214710_s_at : CCNB1 219148_at : PBK 204170_s_at : CKS2 202870_s_at : CDC20 202705_at : CCNB2 204962_s_at : CENPA 202095_s_at : BIRC5 202954_at : UBE2C 208079_s_at : AURKA 201088_at : KPNA2 204033_at : TRIP13 204825_at : MELK 218355_at : KIF4A 205034_at : CCNE2 206102_at : GINS1 204641_at : NEK2 207828_s_at : CENPF 218009_s_at : PRC1 202503_s_at : KIAA0101 218039_at : NUSAP1 204026_s_at : ZWINT 203362_s_at : MAD2L1 210559_s_at : CDC2 222077_s_at : RACGAP1 205357_s_at : AGTR1 209189_at : FOS 201693_s_at : EGR1 202768_at : FOSB 203929_s_at : MAPT 203438_at : STC2 203439_s_at : STC2 203130_s_at : KIF5C 219197_s_at : SCUBE2 204014_at : DUSP4 201860_s_at : PLAT 205898_at : CX3CR1 217889_s_at : CYBRD1 221796_at : NTRK2 204731_at : TGFBR3 205710_at : LRP2 204823_at : NAV3 205381_at : LRRC17 212865_s_at : COL14A1 209291_at : ID4 213933_at : PTGER3

Clustering absolute log2 levels

−3 −1 1 2 3Value

Color Key

Figure : Gene Expression Levels in Primary Breast Cancer TumorSamples

58 / 60

Data Security in BioInformatics

I New attacks on anonymized data present a risk to patientsprivacy.

I Very few cloud services guarantee secure processing.I Highly distributed systems add even more attack vectors.I Attacks prompted a Presidential commission on WGS

privacy.

59 / 60

Data Security: A Freebie!

I Random Projection Offers Some ProtectionsI The only full vectors transmitted are cluster centroids,

which by definition are an aggregate of many vectors.I Showing dissimilarity should be somewhat straightforward

I v =√

nk RT u, v ′ =

√kn vT R̂−1

I similarity = ||v , v ′||2

60 / 60

Questions ?

61 / 60