1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions 521 zB+-Trees...

30
1 Indexing in DBMSs Erik Selberg 590db 4/29/98

description

3 Motivation zData stored on disk pages in one way yO(n) space zData can be ordered one way (if at all) yO(log n) or O(1) lookup for one attribute yO(n) lookup for the rest zMake lookups faster yIncrease space necessary yWhat about speed of other operations?

Transcript of 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions 521 zB+-Trees...

Page 1: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

1

Indexing in DBMSs

Erik Selberg590db4/29/98

Page 2: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

2

OutlineMotivationCost Functions & 521B+-Trees

ISAMUnstructured Text & IRConclusion

Page 3: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

3

MotivationData stored on disk pages in one way

O(n) spaceData can be ordered one way (if at all)

O(log n) or O(1) lookup for one attribute O(n) lookup for the rest

Make lookups faster Increase space necessary What about speed of other operations?

Page 4: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

4

Cost FunctionsB data pages on diskR records per page

O(n) = O(BR)D I/O time (~25ms)C CPU time (~1-10ms)H Hash function time (~1-10ms)

Page 5: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

5

DBMS operationsScan - fetch all recordsSearch w/ Equality

Lookups and ModificationsSearch w/ RangeInsertDelete

Bulk operations may be amortized!

Page 6: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

6

Baseline StorageUnorganized (heap)Sorted

Sorted on one keyHashed

static hashing using chaining

Page 7: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

7

Unorganized HeapsScan BD + BRCSearch = 1/2 (BD + BRC)Search <> BD + BRCInsert 2D + CDelete C + D

Challenge: make this worse

Page 8: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

8

SortedScan BD + BRCSearch = D lg B + C lg RSearch <> D lg B + C lg R + #Insert (D lg B + C lg R) + (BD + BRC)Delete (D lg B + C lg R) + (BD + BRC)

Good for range, crappy for rest

Page 9: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

9

Static Hash w/ ChainingScan 1.25(BD + BRC)Search = H + D + 1/2RCSearch <> 1.25(BD + BRC)Insert (H + D + 1/2RC) + (C + D)Delete (H + D + 1/2RC) + (C + D)

Need to grow and shrink hash tableBad hashes hose you

Page 10: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

10

Cost summary

File Type Scan Search = Search <> I nsert Delete

Heap BD + BRC ½(BD + BRC) BD + BRC 2D + C C+D

Sorted BD + BRC D lg B + C lg R D lg B + C lg R + # Srch + BD + BRC Srch + BD + BRC

Hashed 1.25(BD + BRC) H + D + ½RC 1.25(BD + BRC) Srch + C + D Srch + C + D

Page 11: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

11

What’s the best structure if:You’re Amazon.com. Lots of equality

lookups, some bulk insertions.

You’re United. Lots of range lookups.

You’re ESPN. Tons of insertions, range lookups. Equal lookups temporal.

Page 12: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

12

What is stored in the index?k key; k* data entryr1 = (Malone, Karl, 123, 13, 4)r2 = (Malone, Moses, 456, 16, 5)

k* = data k* = r1k* = <k, rid> k* = <Malone, r1>k* = <k, rid list> k* = <Malone, (r1, r2)>

Page 13: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

13

Clustered IndicesOrder date entries in a similar way to

data records on diskOnly one clustered index per table

Index Index

Data entries

Data Records

Page 14: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

14

Sparse and Dense IndicesDense - one entry per

record (1-1)Sparse - one entry per

page Clustered, therefore

only one per tableInverted on a field

Dense secondary indexFully Inverted

All fields have index

Baker

Hawkins

Payton

Baker, 4Ellis, 14Foster, 7

Hawkins, 9Keefe, 5

Malone, 12

Payton, 7Stockton, 13

4

75

7

9

1312

14

Page 15: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

15

Primary and Secondary Indicies

Primary Index is over the Primary KeyPrimary stores data entry as recordsPrimary has no duplicatesShould only be one

Secondary stores as <k, rid> or <k, rid list>

Page 16: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

16

B-TreesB is for Balanced (that’s good enough for

me)B-Tree

Each node has d items, at most d+1 children Balanced tree

B+-Tree Data at leaves Leaves doubly-linked

Page 17: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

17

A B+-Tree20 40 8060

6 15 30 98

1* 2* 3* 6* 9* 99*24*29*

...

18*19*

Keys are at leavesNot all nodes / leaves are full

Common impls keep 50% minimum occupancy

Page 18: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

18

B+-Tree CostsAssume: d == RScan BD + BRCSearch = D lg B + C lg RSearch <> D lg B + #Insert RCD lg BDelete RCD lg B

Some extra work to keep balance

Page 19: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

19

Summary + B+-Tree costs

File Type Scan Search = Search <> I nsert Delete

Heap BD + BRC ½(BD + BRC) BD + BRC 2D + C C+D

Sorted BD + BRC D lg B + C lg R D lg B + C lg R + # Srch + BD + BRC Srch + BD + BRC

Hashed 1.25(BD + BRC) H + D + ½RC 1.25(BD + BRC) Srch + C + D Srch + C + D

B+-Tree BD + BRC D lg B + C lg R D lg B + C lg R + # RCD lg B RCD lg B

Page 20: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

20

ISAM TreesSimilar to B+-TreeNot balanced, uses chaining

Faster Insert / Delete, slower SearchInternal nodes are static

Good for static DBs and data warehouses

Page 21: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

21

Sparse and Clustered Indices

Remember that bit about only one clustered index per table?

Only one clustered index per tableTherefore, only one index has values

that can be read sequentially without lots of page requests

Page 22: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

22

How many locks do need to...

Insert a new item into DBUnsorted?Sorted?Hash?B+-Tree?ISAM?

Page 23: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

23

Unstructured TextDatabase => structured data

Schemas Tables OLTP

Information Retrieval => unstructured

So they don’t have much to do with one another, right?

Page 24: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

24

IR Queries Karl AND Malone

“Karl Malone”

Karl NEAR/2 Malone

SELECT Docs(D)WHERE “Karl” in D AND “Malone” in D

SELECT Docs(D)WHERE “Karl Malone” in D

Does this mean “X Y” is a single term?

SELECT Docs(D)WHERE …uh…?

Page 25: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

25

Structuring TextPosition is structure!

Karl: par 1, sen 1, word 4

Malone: par 1, sen 1, word 5 par 2, sen 1, word 2 par 3, sen 1, word 7,

zone quote

Admiral KO’d by Jazz power-forward; Malone fined and suspended.

SALT LAKE CITY -- Karl Malone has assured David Robinson the elbow blow that knocked Robinson unconscious was unintentional. Robinson doubts he blow was intended to hurt him, but is not certain.

Nevertheless, Malone on Friday was suspended without pay for one game and fined $5,000 by Rod Thorn, the NBA's senior VP of basketball operations, who normally deals with cases of discipline.

"While I do not believe that Malone intentionally elbowed Robinson, players have a responsibility not to recklessly swing their elbows in a manner that could cause injury to another player," Thorn said.

Malone missed Utah's game Friday night, but the Jazz didn't miss a beat without its leading scorer and routed the L.A. Clippers 127-99.

Meanwhile, Robinson sat out the Spurs' game with Seattle, but San Antonio overcame his absence to beat the SuperSonics 99-84.

The suspension forced Malone to miss just the fifth game of his 13-year career. He had played in 543 consecutive games -- the third-longest streak in the NBA and first for consecutive starts -- and had played in 844 of the Jazz's previous 845 games.

Page 26: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

26

IR Queries in SQLQuery: “Karl Malone”, RobinsonMeaning: Docs w/ “Karl Malone” and Robinson

TextIndex(word: string, doc: int, pos: int) SELECT W1.doc

FROM TextIndex W1, W2, W3WHERE (W1.doc = W2.doc && W2.doc = W3.doc) && (W1.word = “Karl” && W2.word = “Malone” && W3.word = “Robinson”) && W1.pos = W2.pos + 1

Page 27: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

27

Indexing Issues in IRIndex method: hash table on wordIR folks think about attributesIR folks munge attributes

elbow* => elbow, elbowing, elbowed, etc. “to be or not to be” => “”

IR folks create search keys Malone => Malone, Stockton, Jazz, Sloan,

Page 28: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

28

IR and DBMSsIR uses DBMS for low-level storage

e.g. hash table storageHash table lookup is only first step

Clustering Relevance Ranking Feedback, Expansion, ...

Full SQL not needed Custom optimized DB performs better

Page 29: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

29

How AltaVista returns so quickly...

Hash indexes mean lots of page requests if there are lots of matches...

Trick #1: use memory.Trick #2: threshold (find 10 pages > 75% rel).Trick #3: hard time limit.

More users, less CPU time / queryTrick #4: prioritize

Try to find 10 in memory

Page 30: 1 Indexing in DBMSs Erik Selberg 590db 4/29/98. 2 Outline zMotivation zCost Functions  521 zB+-Trees yISAM zUnstructured Text  IR zConclusion.

30

SummaryConcerned about B, R, not just nHash for equality, B+-Tree for rangeOne index gives good disk performanceIR uses hash indexingIR stores term information

Indexing helps performance, but youstill need to think about what to index!