Paolo Ferragina Dipartimento di Informatica Università di Pisa
-
Upload
bertha-dunlap -
Category
Documents
-
view
20 -
download
3
description
Transcript of Paolo Ferragina Dipartimento di Informatica Università di Pisa
![Page 1: Paolo Ferragina Dipartimento di Informatica Università di Pisa](https://reader033.fdocuments.us/reader033/viewer/2022051416/5681317b550346895d97f413/html5/thumbnails/1.jpg)
IR
Paolo FerraginaDipartimento di Informatica
Università di Pisa
![Page 2: Paolo Ferragina Dipartimento di Informatica Università di Pisa](https://reader033.fdocuments.us/reader033/viewer/2022051416/5681317b550346895d97f413/html5/thumbnails/2.jpg)
Paradigm shift:
Web 2.0 is about the many
![Page 3: Paolo Ferragina Dipartimento di Informatica Università di Pisa](https://reader033.fdocuments.us/reader033/viewer/2022051416/5681317b550346895d97f413/html5/thumbnails/3.jpg)
Do big DATA need big
PCs ??
an Italian Ad of the ’80 about a BIG brush or a brush BIG....
![Page 4: Paolo Ferragina Dipartimento di Informatica Università di Pisa](https://reader033.fdocuments.us/reader033/viewer/2022051416/5681317b550346895d97f413/html5/thumbnails/4.jpg)
big DATA big PC ?
We have three types of algorithms: T1(n) = n, T2(n) = n2, T3(n) = 2n
... and assume that 1 step = 1 time unit
How many input data n each algorithm may process within t time units?
n1 = t, n2 = √t, n3 = log2 t
What about a k-times faster processor? ...or, what is n, when the available time is k*t ?
n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t
![Page 5: Paolo Ferragina Dipartimento di Informatica Università di Pisa](https://reader033.fdocuments.us/reader033/viewer/2022051416/5681317b550346895d97f413/html5/thumbnails/5.jpg)
A new scenario for Algorithmics
Data are more available than even before
n ➜ ∞ ... is more than a theoretical assumption
The RAM model is too simple
Step cost is (1)
![Page 6: Paolo Ferragina Dipartimento di Informatica Università di Pisa](https://reader033.fdocuments.us/reader033/viewer/2022051416/5681317b550346895d97f413/html5/thumbnails/6.jpg)
The memory hierarchy
CPU RAM
1CPUregisters
L1 L2 RAM
Cache Few MbsSome nanosecsFew words fetched
Few GbsTens of nanosecsSome words fetched
HD net
Few Tbs
Many TbsEven secsPackets
Few millisecsB = 32K page
![Page 7: Paolo Ferragina Dipartimento di Informatica Università di Pisa](https://reader033.fdocuments.us/reader033/viewer/2022051416/5681317b550346895d97f413/html5/thumbnails/7.jpg)
Does Virtual Memory help ? M = memory size, N = problem size p = prob. of memory access [0,3÷0,4 (Hennessy-
Patterson)] C = cost of an I/O [105 ÷ 106 (Hennessy-
Patterson)]
If N ≤ M, then the cost per step is 1
If N=(1+) M, then the avg cost per step is:
1 + C * p * /(1+)
This is at least > 104 * /(1+)
If = 1/1000
( e.g. M = 1Gb, N = 1Gb + 1Mb )
Avg step-cost is > 20
![Page 8: Paolo Ferragina Dipartimento di Informatica Università di Pisa](https://reader033.fdocuments.us/reader033/viewer/2022051416/5681317b550346895d97f413/html5/thumbnails/8.jpg)
The I/O-model
Spatial locality or Temporal locality
track
magnetic surface
read/write armread/write head
“The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)
Less and faster I/Os caching
CPU RAM HD1
B
Count I/Os
![Page 9: Paolo Ferragina Dipartimento di Informatica Università di Pisa](https://reader033.fdocuments.us/reader033/viewer/2022051416/5681317b550346895d97f413/html5/thumbnails/9.jpg)
Other issues other models
Random vs sequential I/Os
Scanning is better than jumping
Not just one CPU
Many PCs, Multi-cores CPUs or even GPUs
Parameter-free algorithms
Anywhere, anytime, anyway... Optimal !!
Streaming algorithms
Parallel or Distributedalgorithms
Cache-oblivious algorithms
![Page 10: Paolo Ferragina Dipartimento di Informatica Università di Pisa](https://reader033.fdocuments.us/reader033/viewer/2022051416/5681317b550346895d97f413/html5/thumbnails/10.jpg)
![Page 11: Paolo Ferragina Dipartimento di Informatica Università di Pisa](https://reader033.fdocuments.us/reader033/viewer/2022051416/5681317b550346895d97f413/html5/thumbnails/11.jpg)
What about energy-consumption ?
[Leventhal, CACM 2008]
≈10 IO/s/W
≈6000 IO/s/W
![Page 12: Paolo Ferragina Dipartimento di Informatica Università di Pisa](https://reader033.fdocuments.us/reader033/viewer/2022051416/5681317b550346895d97f413/html5/thumbnails/12.jpg)
Our topics, on an exampleWeb
Crawler
Page archive
Which pagesto visit next?
Query
Queryresolver
?
Ranker
PageAnalizer
textStructure
auxiliary
Indexer
Hashing
Data Compression
DictionariesSorting
Linear AlgebraClusteringClassification
![Page 13: Paolo Ferragina Dipartimento di Informatica Università di Pisa](https://reader033.fdocuments.us/reader033/viewer/2022051416/5681317b550346895d97f413/html5/thumbnails/13.jpg)
Warm up...
Take Wikipedia in Italian, and compute word freq:
Few GBs n 109 words
How do you proceed ?? Tokenize into a sequence of strings Sort the strings Create tuples < word, freq >
![Page 14: Paolo Ferragina Dipartimento di Informatica Università di Pisa](https://reader033.fdocuments.us/reader033/viewer/2022051416/5681317b550346895d97f413/html5/thumbnails/14.jpg)
Binary Merge-Sort
Merge-Sort(A,i,j)01 if (i < j) then02 m = (i+j)/2; 03 Merge-Sort(A,i,m);04 Merge-Sort(A,m+1,j);05 Merge(A,i,m,j)
Merge-Sort(A,i,j)01 if (i < j) then02 m = (i+j)/2; 03 Merge-Sort(A,i,m);04 Merge-Sort(A,m+1,j);05 Merge(A,i,m,j)
DivideConquer
Combine
1 2 8 10 7 9 13 19
1 2 7
Merge is linear in the
#items to be merged
![Page 15: Paolo Ferragina Dipartimento di Informatica Università di Pisa](https://reader033.fdocuments.us/reader033/viewer/2022051416/5681317b550346895d97f413/html5/thumbnails/15.jpg)
But...
Few key observations:
Items = (short) strings = atomic... (n log n) memory accesses (I/Os ??)
[5ms] * n log2 n ≈ 3 years
In practice it is a “faster”, why?
![Page 16: Paolo Ferragina Dipartimento di Informatica Università di Pisa](https://reader033.fdocuments.us/reader033/viewer/2022051416/5681317b550346895d97f413/html5/thumbnails/16.jpg)
Implicit Caching…
10 2
2 10
5 1
1 5
13 19
13 19
9 7
7 9
15 4
4 15
8 3
3 8
12 17
12 17
6 11
6 11
1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17
1 2 5 7 9 10 13 19 3 4 6 8 11 12 15 17
1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19
log2 N
MN/M runs, each sorted in internal memory (no I/Os)
2 passes (one Read/one Write) = 2 * (N/B)
I/Os
— I/O-cost for binary merge-sort is ≈ 2 (N/B) log2 (N/M)
Log
2 (
N/M
)
2 passes
(R/W)
2 passes
(R/W)
![Page 17: Paolo Ferragina Dipartimento di Informatica Università di Pisa](https://reader033.fdocuments.us/reader033/viewer/2022051416/5681317b550346895d97f413/html5/thumbnails/17.jpg)
B
A key inefficiency
1 2 4 7 9 10 13 19 3 5 6 8 11 12 15 17
B
After few steps, every run is longer than B !!!
B
We are using only 3 pagesBut memory contains M/B pages ≈ 230/215 = 215
B
OutputBuffer Disk
1, 2, 3
1, 2, 3
OutputRun
4, ...
![Page 18: Paolo Ferragina Dipartimento di Informatica Università di Pisa](https://reader033.fdocuments.us/reader033/viewer/2022051416/5681317b550346895d97f413/html5/thumbnails/18.jpg)
Multi-way Merge-Sort
Sort N items with main-memory M and disk-pages B:
Pass 1: Produce (N/M) sorted runs. Pass i: merge X = M/B-1 runs logX N/M passes
Main memory buffers of B items
Pg for run1
Pg for run X
Out Pg
DiskDisk
Pg for run 2
. . . . . .
. . .
![Page 19: Paolo Ferragina Dipartimento di Informatica Università di Pisa](https://reader033.fdocuments.us/reader033/viewer/2022051416/5681317b550346895d97f413/html5/thumbnails/19.jpg)
Cost of Multi-way Merge-Sort
Number of passes = logX N/M logM/B (N/M)
Total I/O-cost is ( (N/B) logM/B N/M ) I/Os
Large fan-out (M/B) decreases #passes
In practice M/B ≈ 105 #passes =1 few mins
Tuning dependson disk features
Compression would decrease the cost of a pass!
N/B
logM/B M = logM/B [(M/B)*B] = (logM/B B) + 1
![Page 20: Paolo Ferragina Dipartimento di Informatica Università di Pisa](https://reader033.fdocuments.us/reader033/viewer/2022051416/5681317b550346895d97f413/html5/thumbnails/20.jpg)
I/O-lower bound for Sorting
Every I/O fetches B items, in memory M
Decision tree with fan out:
B
M
There are N/B steps in which x B! cmp-outcomes
BN
t
BB
MN /)!(!
We get t = ( (N/B) logM/B N/B ) I/Os
Find t > N/B such that:
![Page 21: Paolo Ferragina Dipartimento di Informatica Università di Pisa](https://reader033.fdocuments.us/reader033/viewer/2022051416/5681317b550346895d97f413/html5/thumbnails/21.jpg)
Keep attention...
If sorting needs to manage arbitrarily long strings
Key observations:
Array A is an “array of pointers to objects”
For each object-to-object comparison A[i] vs A[j]: 2 random accesses to 2 memory locations A[i] and A[j] (n log n) random memory accesses (I/Os ??)
Memory containing the strings
A
Again chaching helps,But it may be less effective than before
Indirect sort