IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

21
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Page 1: IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

IR

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Page 2: IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Paradigm shift:

Web 2.0 is about the many

Page 3: IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Do big DATA need big

PCs ??

an Italian Ad of the ’80 about a BIG brush or a brush BIG....

Page 4: IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

big DATA big PC ?

We have three types of algorithms: T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit

How many input data n each algorithm may process within t time units?

n1 = t, n2 = √t, n3 = log2 t

What about a k-times faster processor? ...or, what is n, when the available time is k*t ?

n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

Page 5: IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

A new scenario for Algorithmics

Data are more available than even before

n ➜ ∞ ... is more than a theoretical assumption

The RAM model is too simple

Step cost is (1)

Page 6: IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

The memory hierarchy

CPU RAM

1CPUregisters

L1 L2 RAM

Cache Few MbsSome nanosecsFew words fetched

Few GbsTens of nanosecsSome words fetched

HD net

Few Tbs

Many TbsEven secsPackets

Few millisecsB = 32K page

Page 7: IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Does Virtual Memory help ? M = memory size, N = problem size p = prob. of memory access [0,3÷0,4 (Hennessy-

Patterson)] C = cost of an I/O [105 ÷ 106 (Hennessy-

Patterson)]

If N ≤ M, then the cost per step is 1

If N=(1+) M, then the avg cost per step is:

1 + C * p * /(1+)

This is at least > 104 * /(1+)

If = 1/1000

( e.g. M = 1Gb, N = 1Gb + 1Mb )

Avg step-cost is > 20

Page 8: IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

The I/O-model

Spatial locality or Temporal locality

track

magnetic surface

read/write armread/write head

“The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)

Less and faster I/Os caching

CPU RAM HD1

B

Count I/Os

Page 9: IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Other issues other models

Random vs sequential I/Os

Scanning is better than jumping

Not just one CPU

Many PCs, Multi-cores CPUs or even GPUs

Parameter-free algorithms

Anywhere, anytime, anyway... Optimal !!

Streaming algorithms

Parallel or Distributedalgorithms

Cache-oblivious algorithms

Page 10: IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Page 11: IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

What about energy-consumption ?

[Leventhal, CACM 2008]

≈10 IO/s/W

≈6000 IO/s/W

Page 12: IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Our topics, on an exampleWeb

Crawler

Page archive

Which pagesto visit next?

Query

Queryresolver

?

Ranker

PageAnalizer

textStructure

auxiliary

Indexer

Hashing

Data Compression

DictionariesSorting

Linear AlgebraClusteringClassification

Page 13: IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Warm up...

Take Wikipedia in Italian, and compute word freq:

Few GBs n 109 words

How do you proceed ?? Tokenize into a sequence of strings Sort the strings Create tuples < word, freq >

Page 14: IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Binary Merge-Sort

Merge-Sort(A,i,j)01 if (i < j) then02 m = (i+j)/2; 03 Merge-Sort(A,i,m);04 Merge-Sort(A,m+1,j);05 Merge(A,i,m,j)

Merge-Sort(A,i,j)01 if (i < j) then02 m = (i+j)/2; 03 Merge-Sort(A,i,m);04 Merge-Sort(A,m+1,j);05 Merge(A,i,m,j)

DivideConquer

Combine

1 2 8 10 7 9 13 19

1 2 7

Merge is linear in the

#items to be merged

Page 15: IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

But...

Few key observations:

Items = (short) strings = atomic... (n log n) memory accesses (I/Os ??)

[5ms] * n log2 n ≈ 3 years

In practice it is a “faster”, why?

Page 16: IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Implicit Caching…

10 2

2 10

5 1

1 5

13 19

13 19

9 7

7 9

15 4

4 15

8 3

3 8

12 17

12 17

6 11

6 11

1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17

1 2 5 7 9 10 13 19 3 4 6 8 11 12 15 17

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19

log2 N

MN/M runs, each sorted in internal memory (no I/Os)

2 passes (one Read/one Write) = 2 * (N/B)

I/Os

— I/O-cost for binary merge-sort is ≈ 2 (N/B) log2 (N/M)

Log

2 (

N/M

)

2 passes

(R/W)

2 passes

(R/W)

Page 17: IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

B

A key inefficiency

1 2 4 7 9 10 13 19 3 5 6 8 11 12 15 17

B

After few steps, every run is longer than B !!!

B

We are using only 3 pagesBut memory contains M/B pages ≈ 230/215 = 215

B

OutputBuffer Disk

1, 2, 3

1, 2, 3

OutputRun

4, ...

Page 18: IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Multi-way Merge-Sort

Sort N items with main-memory M and disk-pages B:

Pass 1: Produce (N/M) sorted runs. Pass i: merge X = M/B-1 runs logX N/M passes

Main memory buffers of B items

Pg for run1

Pg for run X

Out Pg

DiskDisk

Pg for run 2

. . . . . .

. . .

Page 19: IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Cost of Multi-way Merge-Sort

Number of passes = logX N/M logM/B (N/M)

Total I/O-cost is ( (N/B) logM/B N/M ) I/Os

Large fan-out (M/B) decreases #passes

In practice M/B ≈ 105 #passes =1 few mins

Tuning dependson disk features

Compression would decrease the cost of a pass!

N/B

logM/B M = logM/B [(M/B)*B] = (logM/B B) + 1

Page 20: IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

I/O-lower bound for Sorting

Every I/O fetches B items, in memory M

Decision tree with fan out:

B

M

There are N/B steps in which x B! cmp-outcomes

BN

t

BB

MN /)!(!

We get t = ( (N/B) logM/B N/B ) I/Os

Find t > N/B such that:

Page 21: IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Keep attention...

If sorting needs to manage arbitrarily long strings

Key observations:

Array A is an “array of pointers to objects”

For each object-to-object comparison A[i] vs A[j]: 2 random accesses to 2 memory locations A[i] and A[j] (n log n) random memory accesses (I/Os ??)

Memory containing the strings

A

Again chaching helps,But it may be less effective than before

Indirect sort