So Much Data
description
Transcript of So Much Data
![Page 1: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/1.jpg)
So Much DataSo Much Data
Bernard ChazelleBernard Chazelle Princeton UniversityPrinceton University
So Little TimeSo Little Time
![Page 2: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/2.jpg)
So Many SlidesSo Many Slides
Bernard ChazelleBernard Chazelle Princeton UniversityPrinceton University
So Little Time So Little Time
(before lunch)(before lunch)
![Page 3: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/3.jpg)
computation
math experimentation
algorithms
![Page 4: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/4.jpg)
Computers have two Computers have two problemsproblems
![Page 5: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/5.jpg)
1. They don’t have steering 1. They don’t have steering wheelswheels
![Page 6: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/6.jpg)
![Page 7: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/7.jpg)
2. End of Moore’s Law
party’s over !
![Page 8: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/8.jpg)
computation
algorithms experimentation
![Page 9: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/9.jpg)
32x 17
22432
= 544
This is not me
![Page 10: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/10.jpg)
FFT
RSA
![Page 11: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/11.jpg)
![Page 12: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/12.jpg)
![Page 13: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/13.jpg)
noisy
low entropy
uncertain
unevenly priced
big
![Page 14: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/14.jpg)
noisy
low entropy
uncertain
unevenly priced
big
![Page 15: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/15.jpg)
Biomedical imaging
Sloan Digital Sky
Survey4 petabytes4 petabytes(~1MG)(~1MG)
10 10 petabytes/yrpetabytes/yr
150 petabytes/yr150 petabytes/yr
![Page 16: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/16.jpg)
Collected works of Micha Sharir
My A(9,9)-th paper
![Page 17: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/17.jpg)
massive input output
Sublinear Sublinear AlgorithmsAlgorithms
Sample tiny fractionSample tiny fraction
![Page 18: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/18.jpg)
Shortest PathsShortest Paths [C-Liu-Magen ’03]
New New YorkYork
DelphiDelphi
![Page 19: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/19.jpg)
Ray ShootingRay Shooting
Volume Intersection Point location
![Page 20: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/20.jpg)
Approximate MSTApproximate MST [C-Rubinfeld-Trevisan ’01]
![Page 21: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/21.jpg)
Reduces to counting connected componentsReduces to counting connected components
![Page 22: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/22.jpg)
EE = no. connected components= no. connected components
varvar << (no. connected components)<< (no. connected components)22
whp, is a good estimator of # connected components
![Page 23: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/23.jpg)
worst case worst case
input spaceinput space
average case average case (uniform)(uniform)
![Page 24: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/24.jpg)
worst case worst case
![Page 25: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/25.jpg)
average case = actuarial view average case = actuarial view
![Page 26: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/26.jpg)
“ OK, if you elect NOT to have the surgery, the insurance company offers 6 days and 7 nights in Barbados. “
![Page 27: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/27.jpg)
arbitrary, unknown random sourcearbitrary, unknown random source
Self-Improving Self-Improving AlgorithmsAlgorithms
![Page 28: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/28.jpg)
Yes ! This could be YOU, too !
![Page 29: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/29.jpg)
E Tk Optimal expected time for random source
time T1 time T2 time T3 time T4
![Page 30: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/30.jpg)
Clustering Clustering [ Ailon-C-Liu-Comandur [ Ailon-C-Liu-Comandur ’05 ]’05 ]
K-median over Hamming K-median over Hamming cubecube
![Page 31: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/31.jpg)
minimize sum of distancesminimize sum of distances
![Page 32: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/32.jpg)
minimize sum of distancesminimize sum of distances
![Page 33: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/33.jpg)
[ Kumar-Sabharwal-Sen ’04 ][ Kumar-Sabharwal-Sen ’04 ]
COST OPT( 1 + )
![Page 34: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/34.jpg)
How to achieve linear limiting How to achieve linear limiting time?time?
Input space {0,1}Input space {0,1}dndn
prob < O(dn)/KSSprob < O(dn)/KSS
Identify coreIdentify core
TailTail::
Use KSS Use KSS
![Page 35: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/35.jpg)
Store sample of Store sample of precomputed KSSprecomputed KSS
Nearest neighborNearest neighborIncremental algorithmIncremental algorithm
![Page 36: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/36.jpg)
Main difficulty: How to spot the tail?Main difficulty: How to spot the tail?
![Page 37: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/37.jpg)
![Page 38: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/38.jpg)
encode
![Page 39: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/39.jpg)
decode
![Page 40: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/40.jpg)
![Page 41: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/41.jpg)
Data inaccessible before noise
What makes you What makes you think it’s wrong?think it’s wrong?
![Page 42: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/42.jpg)
Data inaccessible before noise
must satisfy some propertymust satisfy some property(eg, convex, bipartite)(eg, convex, bipartite)but does not quitebut does not quite
![Page 43: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/43.jpg)
f(x) = ?f(x) = ?
x
f(x)
data
f = access function
![Page 44: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/44.jpg)
f(x) = ?f(x) = ?
x
f(x)
f = access function
![Page 45: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/45.jpg)
f(x) = ?f(x) = ?
x
f(x)
But life being what it is…
![Page 46: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/46.jpg)
f(x) = ?f(x) = ?
x
f(x)
![Page 47: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/47.jpg)
)(O
Humans
Define distance from any object to data class
![Page 48: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/48.jpg)
f(x) = ?f(x) = ?
x
g(x)
x1, x2,…
f(x1), f(x2),…
filter
g is access function for:
![Page 49: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/49.jpg)
Online DataOnline DataReconstructiReconstructi
onon
![Page 50: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/50.jpg)
Monotone function: [n] Rd
Filter requires polylog (n) lookups
[ Ailon-C-Liu-Comandur ’04 ][ Ailon-C-Liu-Comandur ’04 ]
![Page 51: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/51.jpg)
Convex Convex polygonpolygon
Filter requires : lookups
[C-Comandur ’06 ]
![Page 52: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/52.jpg)
Convex Convex terrainterrain
lookups
Filter requires :
![Page 53: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/53.jpg)
Iterated planar separator Iterated planar separator theoremtheorem
![Page 54: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/54.jpg)
Iterated planar separator Iterated planar separator theoremtheorem
![Page 55: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/55.jpg)
Iterated Iterated (weak)(weak) planar separator theorem planar separator theoremin sublinear time!in sublinear time!
![Page 56: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/56.jpg)
Using epsilon-nets in spaces of unbounded VC Using epsilon-nets in spaces of unbounded VC dimensiondimension
reconstruct
![Page 57: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/57.jpg)
bipartite graph
k-connectivity expander
![Page 58: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/58.jpg)
denoising low-dim attractor sets
![Page 59: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/59.jpg)
Priced Priced
computation & computation & accuracyaccuracy
spectrometry/cloning/gene chipspectrometry/cloning/gene chip PCR/hybridization/chromatographyPCR/hybridization/chromatography gel electrophoresis/blottinggel electrophoresis/blotting
001100001010001111110011001101011100001100000101111o1o1100001100
Linear programmingLinear programming
![Page 60: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/60.jpg)
Pricing dataPricing data
Factoring is easy. Here’s why…Factoring is easy. Here’s why…Gaussian mixture sample: Gaussian mixture sample: 0010010100100110101010100100101001001101010101….….
![Page 61: So Much Data](https://reader036.fdocuments.us/reader036/viewer/2022062400/56813aa2550346895da29b9e/html5/thumbnails/61.jpg)
Collaborators:Collaborators: Nir Ailon, Seshadri Comandur, Ding LiuAvner Magen, Ronitt Rubinfeld, Luca Trevisan