Random Projection, Generative Lattices, and Redundancy to...
Transcript of Random Projection, Generative Lattices, and Redundancy to...
Random Projection, Generative Lattices, andRedundancy to Combat Scalability Bounds In
Distributed Computing
Lee A. Carraher
School of Electronic and Computing SystemsUniversity of Cincinnati
March 3, 2014
Introduction
I From CincinnatiI B.S. Computer Engineering (UC 2008)I M.S. Computer Science (UC 2012)
I Advisor Prof. Fred AnnexsteinI Ph.D Computer Engineering (ongoing)
I Advisor: Prof. Fred Annexstein
2 / 60
Interests
Some research interests:I Machine Learning (bioinformatics, filtering)I Inverse Problems (min/max problems)I Parallel Computing (CUDA, MPI)I Distributed Computing (Mapreduce, Spark)I Big Data
3 / 60
Motivations
“What are the important problems of your field?”- Richard Hamming
4 / 60
Data Storage is Cheap
I Store Everything Because it’s cheap (NSA...)
5 / 60
Big Problems in Computing
Two problems:1. We are storing more data than we can effectively process
(n→∞)2. Stagnated Clock speeds
I materials problemI energy problemI fundamental cooling problem (Landauer’s Principle)
6 / 60
Ways To Attack Computing Problems
Faster Processors
More Processors
Better Algorithms
7 / 60
Parallel Computing
Simple, Add more processors!Basic Issues
I Communication BottlenecksI Algorithmic Bottlenecks
8 / 60
Scaled Speedup Example
0 50 100 150 200Parallel Speedup
5
10
15
20
25
30
Tota
l Spe
edup
Query: Total Speedup as a function of Parallel Speedup
ParallelismOur Theoretical Speedup (93.4%)90% Serial Code95% Serial Code97% Serial CodeTheoretical Speedup
I 80/20 rule (Pareto) isn’t even on here!
9 / 60
Ways To Attack Computing Problems
Faster Processors
More Processors
Better Algorithms
10 / 60
Has Moore’s Law Stalled?
I Clock speed is dead.11 / 60
Ways To Attack Computing Problems
Faster Processors
More Processors
Better Algorithms
12 / 60
Better Algorithms?
I This attack plan is not well definedI Some algorithms are optimal
13 / 60
Ways To Attack Computing Problems
Faster Processors
More Processors
Better Algorithms
I What about the overlaps?
14 / 60
RPHASH
Random Projection Hashing Goals:I ScalabilityI Minimize Communication Complexity
Tradeoff:I RedundancyI Accuracy
15 / 60
Related Work
Database Clustering MethodsI DBScanI CliqueI CLARANSI Proclus
16 / 60
Background
I Big DataI CODI Locality Sensitive Hash FunctionsI Space PartitioningI Lattices
I A Decoding ExampleI Leech LatticeI Leech Decoder
I Functional ProgrammingI MR/HadoopI Random Projection
17 / 60
Working Definition of “Big Data”
Sales and Commercial hype aside,
Definition (Big Data)A set of data processing problems in which the required data istoo large to reside in main memory.Thrashing between MM and HD(even solid state) . unscalablealgorithm
I Health MetricsI DNA SequencesI Website/Click Metrics
18 / 60
COD
Curse of DimensionalityCOD is sometimes cited as the cause for the distance functionloosing its usefulness for high dimensions. This arises from theratio of metric space partitioning to hypersphere embedding.
limd→∞
Vol(Sd )
Vol(Cd )=
πd/2
d2d−1Γ (d/2)→ 0
Given a single distribution, the minimum and the maximumdistances become indiscernible. Or the relative majority ofspace is outside of the sphere
19 / 60
LSH Hash Families
Definition (Locality Sensitive Hash Function)let H = {h : S → U} is (r1, r2,p1,p2)−sensitive if for anyu, v ∈ S
1. if d(u, v) ≤ r1 then PrH[h(u) = h(v)] ≥ p1
2. if d(u, v) > r2 then PrH[h(u) = h(v)] ≤ p2
For this family ρ = log p1log p2
20 / 60
An Example Hash Family
.
.
.
.
.
.
.
.
.
.
0
1
2
3
...
w
w
w
r
Figure : Random Projection of R2 → R1
21 / 60
Voronoi Tiling
Voronoi partitioning is optimal in 2D.
Figure : Voronoi Partitioning of R2
22 / 60
Voronoi
I Voronoi diagrams make for very efficient hash functions in2d because, by definition, a point within a Voronoi region isnearest to the regions representative point.
I Voronoi regions provide an optimal solution to the NNpartitioning in 2-d Space!
I However, for arbitrary dimension d, Voronoi diagramsrequire Θ(nd/2)-space, and no known optimal pointlocation algorithms exists.
23 / 60
Lattices
Instead we will consider lattices, which provide regular spacepartitioning and scale to arbitrarily large dimensional space,and have sub-linear nearest center search algorithmsassociated with them.
Definition (Lattice in Rn)let v1, ..., vn be n linear independent vectors wherevi = vi,1, vi,2, ..., vi,n The lattice Λ with basis {v1, ..., vn} is the setof all integer combinations of v1, ..., vn the integer combinationsof the basis vectors are the points of the lattice.
Λ = {z1v1 + z2v2 + ...+ znvn|zi ∈ Z,1 ≤ i ≤ n}
24 / 60
Examples in 2D
Figure : Square(left) and Hexagonal(right) Lattices in R2
25 / 60
Constant Time Decoding
I Certain lattices allow us to find the nearest representativepoint in constant time.
I For example the above square lattice.I The nearest point can be found by simply rounding our real
valued point to its nearest integer.I With the exception of a few exceptional lattices, more
complex lattices have more complex searches (exponentialas d increases ).
26 / 60
Exceptional Higher Dimensional Lattices
The previous lattices work well in R2, but our data spaces are ingeneral� 2.
I fortunately there are some higher dimensional lattices, withefficient nearest center search algorithms.
I E8 or Gosset’s Lattice, is one such lattice inI it is also the densest lattice packing in R8.
27 / 60
Example of decoding E8
E8 can be formed by gluing two D8 integer lattices together andshifting by a vector of 1
2 . This gluing of less dense lattices andshifting by a “glue vector” is a common theme in finding denselattices.
I Decoding D8 is simpleI E8 = D8 ∩ D8 + 1
2I both cosets of D8 can be computed in parallelI D8’s decoding algorithm consists of rounding all values to
their nearest integer value s.t they sum to an even number
28 / 60
Example of decoding E8
define f (x) and g(x) to round the components of x, except ing(x) we round the furthest value from an integer in the wrongdirection.let
x =< 0.1,0.1,0.8,1.3,2.2,−0.6,−0.7,0.9 >
thenf (x) =< 0,0,1,1,2,−1,−1,1 >, sum = 3
andg(x) =< 0,0,1,1,2,0,−1,1 >, sum = 4
I since g(x) is even, it is the nearest lattice point in D8
29 / 60
Example of decoding E8 conti..
I Include the coset ∩D8 + 12 .
I We can do this by subtracting 12 from all the values of x.
f (x − 12
) =< 0,0,0,1,2,−1,−1,0 >, sum = 1
g(x − 12
) =< -1,0,0,1,2,−1,−1,0 > sum = 0
Now we find the coset representative that is closest to x using asimple distance metric.
||x − g(x)||2 = 0.65
||x − g(x12
)||2 = 0.95
So this case it is the first coset representative:
< 0,0,1,1,2,0,−1,1 >
30 / 60
Leech Lattice
By gluing sets of E8 together in a way originally conceived byCurtis’ MOG, we can get an even higher dimensional denselattice called the Leech lattice.
31 / 60
Leech Lattice
Here we will state some attributes of the leech lattice as well asgive a comparison to other lattices by way of Eb/N0 and thecomputational cost of decoding.Some Important Attributes:
I Densest Regular Lattice Packing in R24
I Lattice Construction can be based on 2 cosets G24
I Sphere Packing Density: π12
12! ≈ 0.00192957I Kmin = 196560
32 / 60
Leech Lattice
0 5 10 15 20 25EB/NO (dB)
�5
�4
�3
�2
�1
Prob
abili
ty o
f Bit
Erro
r - lo
g10(
Pb)
Error Correcting Performance (3072000 Samples)
Leech hex -.2db and QAM16E8 2PSKUnencoded QAM64Unencoded QAM16
Figure : Performance of Some Coded and Unencoded DataTransmission Schemes
33 / 60
Leech Lattice Decoding
Some information about the decoding algorithm:I The decoding of the leech lattice is based closely on the
Decoding of the Golay Code.I In general, advances in either Leech decoding or binary
Golay decoding imply an advance in the other.I The decoding method used in this implementation is based
on Amrani and Be’ery’s ’96 publication for decoding theLeech lattice, and consists of around 519 floating pointoperations and suffers a gain loss of only 0.2bB.
I In general decoding complexity scales exponentially withdimension.
Next is an outline of the decoding process.
34 / 60
Leech Decoder
QAM Decoder
A Points
Block Confidence
Even Reps
Min Over H6
Even Reps
H-Parity
Even
K-Parity
Even
QAM Decoder
B Points
Block Confidence
Odd Reps
Block Confidence
Even Reps
Block Confidence
Odd Reps
Min Over H6
Odd Reps
H-Parity
Odd
Min Over H6
Even Reps
Min Over H6
Odd Reps
H-Parity
Even
K-Parity
Odd
H-Parity
Odd
Compare Candidate Codewords
24 Reals Received
K-Parity
Odd
K-Parity
Even
Figure : Leech Lattice Decoder
35 / 60
Functional Programming
Figure : Scargill, by Tim (timble.me.uk/blog/author/tim)
Parallel and Functional ProgrammingI Instead of moving the mountain to the people,Move the the
people to the mountainsI Where mountains are data and people are functions
respectively
36 / 60
Map Reduce Program Design
Figure : Map Reduce (courtesy: map-reduce.wikispaces.asu.edu
37 / 60
Hadoop
Hadoop is an open source implementation of the map reduceframework created and maintained by the Apache SoftwareGroup. Benefits
I Open SourceI Popular, and MaintainedI FreeI Implemented and compatible with Amazon EC2I Takes care of the networking and fault tolerance drudgery
of parallel system programming.
38 / 60
Hadoop Parallel System Design
Figure : Hadoop System Design (ibm.com/developerworks)
39 / 60
EC2 Cloud
Cloud and Distributed ServicesI Scalable to data processing problems needsI Very Low Cost Processing ModelI Always Up to Data HW ResourcesI Zero HW maintenance and overhead costs
40 / 60
Mahout
Mahout is an open source library of machine learningalgorithms made for Hadoop.Clustering Algorithms:
I Canopy ClusteringI K-MeansI Mean ShiftI LDAI MinHash
41 / 60
Random Projection
Figure : Random Projection: R3 → R2
42 / 60
Random Projection
Theorem (JL - Lemma)
(1− ε)‖u − u′‖2 ≤ ‖f (u)− f (u′)‖2 ≤ (1 + ε)‖u − u′‖2
ε - is an distortion quantityu,u′ ∈ U - two independent vectorsf - a random projection mapping
Rd → Rl
l ∝ Θ(log(n)
ε2log(1/ε))
43 / 60
FJLT
Fast Johnson-Lindenstrauss TransformI New and cool!I Using Heisenberg Uncertainty in Harmonic Analysis, a
spectrum and its signal cannot both be concentratedI Precondition projection with DFT, (some matrices have
very fast DFTs)I ≈ Θ(d log(d) + ε−3 log2(n)) vs Θ(dε−2 log(n))
44 / 60
Motivations of RP Hash
I Parallel Structure of a scalable parallel algorithm (LogReduce)
I Low Communication Overhead (Hashes)I Non -Parallel Iterative Structure (Per core redundancy)I Approximation is usually good enough
45 / 60
General Idea of RPHash
Figure : Multi-Probe Random Projection Intersection Probabilities
I generative space quantizationI random projectionI sequential multi-probe stochastic process
46 / 60
Occultation Problem
The occultation problem is the probability of two or moreindependent distributions overlapping in projected space.
I based on the distribution variance and angle of theprojective plane
I Applicable bounds from Urruty ’07. d is number of probes
I limd→∞
1−(
2(r1+r2)π‖d−c‖
)d
I In RPHash, d is the dimensionality (24).I RPHash projections are orthogonal
47 / 60
Sequential algorithm
1. Generate Random Projection matrix P2. Maintain DBcount of hash id’s3. Maintain DBcent Array of centroids corresponding to vectors4. Forall x ∈ X :
4.1 index = LatticeDec(xP>)4.2 DBcount [ID]++4.3 DBcent [ID]+ = x
5. sort[DBcount ,DBcent ]6. return DBcent [0 : k ]
48 / 60
Comparison with standard k-means
49 / 60
Sequential Algorithm Time Results
5000 10000 15000 20000Number of Vectors
2
0
2
4
6
Ela
pse
d T
ime (
s)
Time on Gaussian Data
RP Hash Mean: y = -0.101 + 0.0002xK-Means Mean: y = -0.050 + 8.3470E-5xRP HashK-Means
4000 5000 6000 7000 8000 9000 10000 11000 12000 13000Number of Dimensions
0
2
4
6
8
10
Ela
pse
d T
ime (
s)
Time on Gaussian Data
RP Hash Mean: y = .3730 + .0006xK-Means Mean: y = .0963 + .0001xRP HashK-Means
50 / 60
Sequential Algorithm Accuracy Results
0 1 2 3 4 5 6 7 8 9Number of Clusters
0.70
0.75
0.80
0.85
0.90
0.95
1.00
PR
Perf
orm
ance
Precision Recall on 50:50 Gaussian Data
RP Hash Mean: y = .8803 - .0155xK-Means Mean: y = .9963 - .0189xRP HashK-Means
0 2000 4000 6000 8000 10000 12000 14000 16000Dimensions
0.70
0.75
0.80
0.85
0.90
0.95
PR
Perf
orm
ance
Precision Recall on 50:50 Gaussian Data
RP Hash Mean: y = .7942 + 3.885E-7xK-Means Mean: y = .8681 - 1.040E-6xRP HashK-Means
51 / 60
Scalability Goal
I It would be nice if our algorithm scaled with the number ofprocessing nodes
I lets look at an algorithm that scales well and try to apply itto our problem
I the canonical hadoop ”Hello World” simulacrum ”WordCount” is a good place to start
52 / 60
Hadoop Word Count
53 / 60
Hadoop Word Count Scalability
54 / 60
Basic Outline
I Each processing node computes hashes and countsI Share the top k buckets to all computersI Aggregate all centroid averages
A Trick to minimize communication and storage reqrmnt.I Two Phase:I Phase 1: Only store counts and communicate IDsI Phase 2: Only accept hash collisions with Phase 1’s top
IDs for all clusters
55 / 60
Parallel Algorithm Phase 1
beginX = {x1, ..., xn}, xk ∈ Rm - data vectorsD - set of available compute nodesH - is a d dimensional LSH functionX̃ ⊆ X - vectors per compute nodePm→d - Gaussian projection matrixCs = {∅} - set of bucket collision countsforeach xk ∈ X̃ do
x̃k ←√
md P
ᵀxk
t = H(x̃k )Cs[t ] = Cs[t ] + 1
endsort({Cs,Cs.index}) return {Cs,Cs.index}[0 : k log(n)]
end
56 / 60
Parallel Algorithm Phase 2
beginX = {x1, ..., xn}, xk ∈ Rm - data vectorsD - set of available compute nodes{Cs,Cs.index} - set of klogn cluster IDs and countsH - is a d-dimensional LSH functionX̃ ⊆ X - vectors per compute nodepm→d ∈ P - Gaussian projection matricesC = {∅} - set of centroidsforeach xk ∈ X̃ do
foreach pm→d ∈ P̃ do
x̃k ←√
md pᵀxk
t = H(x̃k ) if t ∈ Cs.index thenC[t ] = C[t ] + xk
endend
endreturn C
end
57 / 60
What to Use this For?
1.G
SM
6531
71.
GS
M65
330
3.G
SM
6531
61.
GS
M65
361
1.G
SM
6536
81.
GS
M65
349
1.G
SM
6578
21.
GS
M65
334
1.G
SM
6536
51.
GS
M65
362
1.G
SM
6534
61.
GS
M65
355
1.G
SM
6577
61.
GS
M65
815
1.G
SM
6575
41.
GS
M65
773
1.G
SM
6587
33.
GS
M65
860
3.G
SM
6586
73.
GS
M65
859
1.G
SM
6532
91.
GS
M65
336
3.G
SM
6533
81.
GS
M65
779
1.G
SM
6583
01.
GS
M65
753
1.G
SM
6580
01.
GS
M65
363
1.G
SM
6537
91.
GS
M65
875
1.G
SM
6531
81.
GS
M65
781
1.G
SM
6576
71.
GS
M65
798
1.G
SM
6578
51.
GS
M65
791
1.G
SM
6579
71.
GS
M65
320
1.G
SM
6581
21.
GS
M65
771
1.G
SM
6581
41.
GS
M65
871
1.G
SM
6587
41.
GS
M65
787
1.G
SM
6533
23.
GS
M65
848
1.G
SM
6532
31.
GS
M65
783
1.G
SM
6577
01.
GS
M65
807
1.G
SM
6578
91.
GS
M65
331
1.G
SM
6531
91.
GS
M65
780
1.G
SM
6535
11.
GS
M65
360
1.G
SM
6534
51.
GS
M65
375
1.G
SM
6535
31.
GS
M65
358
3.G
SM
6536
41.
GS
M65
377
1.G
SM
6537
33.
GS
M65
366
1.G
SM
6537
63.
GS
M65
356
3.G
SM
6535
93.
GS
M65
372
3.G
SM
6535
43.
GS
M65
370
3.G
SM
6575
53.
GS
M65
756
3.G
SM
6583
73.
GS
M65
324
3.G
SM
6532
53.
GS
M65
322
3.G
SM
6532
13.
GS
M65
760
1.G
SM
6534
83.
GS
M65
831
3.G
SM
6534
33.
GS
M65
367
1.G
SM
6534
73.
GS
M65
326
3.G
SM
6537
81.
GS
M65
340
3.G
SM
6535
73.
GS
M65
795
1.G
SM
6580
53.
GS
M65
339
3.G
SM
6533
53.
GS
M65
769
3.G
SM
6536
93.
GS
M65
333
3.G
SM
6581
93.
GS
M65
371
3.G
SM
6535
23.
GS
M65
374
3.G
SM
6534
43.
GS
M65
350
202917_s_at : S100A8 220414_at : CALML5 204475_at : MMP1 201195_s_at : SLC7A5 203915_at : CXCL9 204533_at : CXCL10 210163_at : CXCL11 211122_s_at : CXCL11 201291_s_at : TOP2A 203560_at : GGH 202779_s_at : UBE2S 217755_at : HN1 201890_at : RRM2 209773_s_at : RRM2 211762_s_at : KPNA2 203213_at : CDC2 218883_s_at : MLF1IP 218585_s_at : DTL 201292_at : TOP2A 218662_s_at : NCAPG 203554_x_at : PTTG1 210052_s_at : TPX2 203764_at : DLGAP5 209714_s_at : CDKN3 214710_s_at : CCNB1 219148_at : PBK 204170_s_at : CKS2 202870_s_at : CDC20 202705_at : CCNB2 204962_s_at : CENPA 202095_s_at : BIRC5 202954_at : UBE2C 208079_s_at : AURKA 201088_at : KPNA2 204033_at : TRIP13 204825_at : MELK 218355_at : KIF4A 205034_at : CCNE2 206102_at : GINS1 204641_at : NEK2 207828_s_at : CENPF 218009_s_at : PRC1 202503_s_at : KIAA0101 218039_at : NUSAP1 204026_s_at : ZWINT 203362_s_at : MAD2L1 210559_s_at : CDC2 222077_s_at : RACGAP1 205357_s_at : AGTR1 209189_at : FOS 201693_s_at : EGR1 202768_at : FOSB 203929_s_at : MAPT 203438_at : STC2 203439_s_at : STC2 203130_s_at : KIF5C 219197_s_at : SCUBE2 204014_at : DUSP4 201860_s_at : PLAT 205898_at : CX3CR1 217889_s_at : CYBRD1 221796_at : NTRK2 204731_at : TGFBR3 205710_at : LRP2 204823_at : NAV3 205381_at : LRRC17 212865_s_at : COL14A1 209291_at : ID4 213933_at : PTGER3
Clustering absolute log2 levels
−3 −1 1 2 3Value
Color Key
Figure : Gene Expression Levels in Primary Breast Cancer TumorSamples
58 / 60
Data Security in BioInformatics
I New attacks on anonymized data present a risk to patientsprivacy.
I Very few cloud services guarantee secure processing.I Highly distributed systems add even more attack vectors.I Attacks prompted a Presidential commission on WGS
privacy.
59 / 60
Data Security: A Freebie!
I Random Projection Offers Some ProtectionsI The only full vectors transmitted are cluster centroids,
which by definition are an aggregate of many vectors.I Showing dissimilarity should be somewhat straightforward
I v =√
nk RT u, v ′ =
√kn vT R̂−1
I similarity = ||v , v ′||2
60 / 60
Questions ?
61 / 60