IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Paolo FerraginaDipartimento di Informatica

Università di Pisa

Paradigm shift:

Web 2.0 is about the many

Do big DATA need big

PCs ??

an Italian Ad of the ’80 about a BIG brush or a brush BIG....

big DATA big PC ?

We have three types of algorithms: T1(n) = n, T2(n) = n2, T3(n) = 2n

... and assume that 1 step = 1 time unit

How many input data n each algorithm may process within t time units?

n1 = t, n2 = √t, n3 = log2 t

What about a k-times faster processor? ...or, what is n, when the available time is k*t ?

n1 = k * t, n2 = √k * √t, n3 = log2 (kt) = log2 k + log2 t

A new scenario for Algorithmics

Data are more available than even before

n ➜ ∞ ... is more than a theoretical assumption

The RAM model is too simple

Step cost is (1)

The memory hierarchy

CPU RAM

1CPUregisters

L1 L2 RAM

Cache Few MbsSome nanosecsFew words fetched

Few GbsTens of nanosecsSome words fetched

HD net

Few Tbs

Many TbsEven secsPackets

Few millisecsB = 32K page

Does Virtual Memory help ? M = memory size, N = problem size p = prob. of memory access [0,3÷0,4 (Hennessy-

Patterson)] C = cost of an I/O [105 ÷ 106 (Hennessy-

Patterson)]

If N ≤ M, then the cost per step is 1

If N=(1+) M, then the avg cost per step is:

1 + C * p * /(1+)

This is at least > 104 * /(1+)

If = 1/1000

( e.g. M = 1Gb, N = 1Gb + 1Mb )

Avg step-cost is > 20

The I/O-model

Spatial locality or Temporal locality

magnetic surface

read/write armread/write head

“The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)

Less and faster I/Os caching

CPU RAM HD1

Count I/Os

Other issues other models

Random vs sequential I/Os

Scanning is better than jumping

Not just one CPU

Many PCs, Multi-cores CPUs or even GPUs

Parameter-free algorithms

Anywhere, anytime, anyway... Optimal !!

Streaming algorithms

Parallel or Distributedalgorithms

Cache-oblivious algorithms

What about energy-consumption ?

[Leventhal, CACM 2008]

≈10 IO/s/W

≈6000 IO/s/W

Our topics, on an exampleWeb

Crawler

Page archive

Which pagesto visit next?

Queryresolver

Ranker

PageAnalizer

textStructure

auxiliary

Indexer

Hashing

Data Compression

DictionariesSorting

Linear AlgebraClusteringClassification

Warm up...

Take Wikipedia in Italian, and compute word freq:

Few GBs n 109 words

How do you proceed ?? Tokenize into a sequence of strings Sort the strings Create tuples < word, freq >

Binary Merge-Sort

Merge-Sort(A,i,j)01 if (i < j) then02 m = (i+j)/2; 03 Merge-Sort(A,i,m);04 Merge-Sort(A,m+1,j);05 Merge(A,i,m,j)

DivideConquer

Combine

1 2 8 10 7 9 13 19

Merge is linear in the

#items to be merged

But...

Few key observations:

Items = (short) strings = atomic... (n log n) memory accesses (I/Os ??)

[5ms] * n log2 n ≈ 3 years

In practice it is a “faster”, why?

Implicit Caching…

1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17

1 2 5 7 9 10 13 19 3 4 6 8 11 12 15 17

1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19

log2 N

MN/M runs, each sorted in internal memory (no I/Os)

2 passes (one Read/one Write) = 2 * (N/B)

— I/O-cost for binary merge-sort is ≈ 2 (N/B) log2 (N/M)

2 passes

A key inefficiency

1 2 4 7 9 10 13 19 3 5 6 8 11 12 15 17

After few steps, every run is longer than B !!!

We are using only 3 pagesBut memory contains M/B pages ≈ 230/215 = 215

OutputBuffer Disk

1, 2, 3

OutputRun

4, ...

Multi-way Merge-Sort

Sort N items with main-memory M and disk-pages B:

Pass 1: Produce (N/M) sorted runs. Pass i: merge X = M/B-1 runs logX N/M passes

Main memory buffers of B items

Pg for run1

Pg for run X

Out Pg

DiskDisk

Pg for run 2

. . . . . .

Cost of Multi-way Merge-Sort

Number of passes = logX N/M logM/B (N/M)

Total I/O-cost is ( (N/B) logM/B N/M ) I/Os

Large fan-out (M/B) decreases #passes

In practice M/B ≈ 105 #passes =1 few mins

Tuning dependson disk features

Compression would decrease the cost of a pass!

logM/B M = logM/B [(M/B)*B] = (logM/B B) + 1

I/O-lower bound for Sorting

Every I/O fetches B items, in memory M

Decision tree with fan out:

There are N/B steps in which x B! cmp-outcomes

MN /)!(!

We get t = ( (N/B) logM/B N/B ) I/Os

Find t > N/B such that:

Keep attention...

If sorting needs to manage arbitrarily long strings

Key observations:

Array A is an “array of pointers to objects”

For each object-to-object comparison A[i] vs A[j]: 2 random accesses to 2 memory locations A[i] and A[j] (n log n) random memory accesses (I/Os ??)

Memory containing the strings

Again chaching helps,But it may be less effective than before

Indirect sort

IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Documents

Transcript of IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Dipartimento di Economia Campus Book Launches...Dipartimento di Economia Campus Book Launches #dec – didactics, engagement and culture Dipartimento di Economia - Università Ca’

Web search engines Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Lucene in action Information Retrieval A.A. 2010-11 – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –

Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.

Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa.

Web Algorithmics - Dipartimento di Informaticapages.di.unipi.it/ferragina/Teach/Copie_Vecchie... · Ranker Page Analizer text Structure auxiliary Indexer. Paolo Ferragina, Web Algorithmics

UNIVERSITA’ DEGLI STUDI DI PADOVA DIPARTIMENTO DI …tesi.cab.unipd.it/53165/1/Giacometti_Elisabetta.pdf · universita’ degli studi di padova dipartimento di scienze economiche

Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.

DIPARTIMENTO DI NGEGNERIA INDUSTRIALE

Documento unico di Dipartimento

Dipartimento di Scienze Economiche Università degli Studi di … · 2013-07-16 · Dipartimento di Scienze Economiche Università degli Studi di Firenze Working Paper Series Dipartimento

Seminario di Analisi Matematica Dipartimento di Matematica ... · Seminario di Analisi Matematica Dipartimento di Matematica dell’Universita di Bolognaµ Anno Accademico 2005-06

Random access to arrays of variable-length items Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Dipartimento di MEDICINA MOLECOLARE DOTTORATO DI …paduaresearch.cab.unipd.it/9231/1/tesi_De_Cassan.pdf · Dipartimento di MEDICINA MOLECOLARE DOTTORATO DI RICERCA IN BIOMEDICINA-

Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Dipartimento di Scienze Economiche Università degli Studi ... · Dipartimento di Scienze Economiche Università degli Studi di Firenze Working Paper Series Dipartimento di Scienze

Vincenzo Guidi Dipartimento di Fisica Università di Ferrara The sensors and semiconductors laboratory Dipartimento di Fisica Università di Ferrara - Italy.

Persone - Dipartimento di Informatica

Documento di Programmazione strategica - Dipartimento di ... · Documento di Programmazione strategica - Dipartimento di Biotecnologie, Chimica e Farmacia aa 2016-2018 Il Dipartimento

Index Construction: sorting Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chap 4.