1
Processing Queries with Processing Queries with Bit-Vector IndexesBit-Vector Indexes
Originally presented byAnand Deshpande
2
MotivationMotivation
Consider the following SQL query
SELECT name, addressFROM studentsWHERE Dept = ‘CSCI’AND Hostel = ‘H2’
To process this query – use a complete scan– use index
3
Example students table Example students table
1
23456789
1011121314
CS
CS
CSCS
CS
CS
EEME
EE
EE
ME
ME
AE
AE
M
M
M
M
M
MM
M
F
F
FF
F
F
H1
H1
H2
H2H2
H1
H1
H2
H3
H3
H3
H4
H4
H4
Abhay Athavale
Bina BajajChinmay Chatterjee
David DeMilloEra EdkeFrank Fernandez
Gauri GaikwadHari HateIndira IraniJaya JoshiKader Khan
Leo Lobo
Meera Malik
Naresh Naik
RID Name Hostel Gender Dept
4
Using an Index to Process Using an Index to Process QueriesQueries Find all records (rids) that match
– Dept = ‘CS’– Hostel = ‘H2’
Intersect the two set of rids Given the rid get the name and address
In the presence of an index -- – FIND is of log order O(log(N)) and – intersects on sorted rids is O(n1 + n2)
5
Processing Queries Processing Queries
1
23456789
1011121314
CS
CS
CSCS
CS
CS
EEME
EE
EE
ME
ME
AE
AE
M
M
M
M
M
MM
M
F
F
FF
F
F
H1
H1
H2
H2H2
H1
H1
H2
H3
H3
H3
H4
H4
H4
Abhay AthavaleBina BajajChinmay Chatterjee
David DeMilloEra EdkeFrank Fernandez
Gauri GaikwadHari HateIndira IraniJaya JoshiKader KhanLeo LoboMeera MalikNaresh Naik
RID Name Hostel Gender Dept
Dept = CS{1, 5, 7, 8, 12, 14}
Hostel = H2{6, 8,11,12}
Dept = CS Hostel = H2{8, 12}
6
What does an Index do?What does an Index do?
Index provides a mapping from Value to a Set of Records (RIDs)
Given a value -- tell me records that have that value
Various kinds of indices– B-Tree– Hash Index– R-Tree
7
B-Tree IndexB-Tree Index
AECS EE
ME
{4, 6} {1,5,7,8,12,14} {2, 9, 11} {3, 10, 13}
B-Tree Index for Department
List of RIDs
8
Index ArchitectureIndex ArchitectureV
alu
e-b
ase
d In
de
x(B
-Tre
e)
1
2
34
5
6
7
8
9
10
11
12
13
14
CS
CS
CS
CS
CS
CS
EE
ME
EE
EE
ME
ME
AE
AE
M
M
M
M
M
M
M
M
F
F
F
F
F
F
H1
H1
H2
H2
H2
H1
H1
H2
H3
H3
H3
H4
H4
H4
RID
Ho
ste
l
Gen
der
Dep
t
List
s of
RID
s
Va
lue
-based
Inde
x(B
-Tre
e)
Lists of RID
s
9
Selectivity of DomainsSelectivity of Domains
Domain is strongly selective if the number of rids for the value is small– example -- primary key– only 1 rid for each value
Domain is weakly selective if the number of rids for the value is large– example -- gender -- male/female– .5 * table_size rid for each value
10
Motivating Bit-VectorsMotivating Bit-Vectors
In queries with constraints on many weakly selective domains, rid intersection costs dominate the cost equations.
AND/OR-ing bit-vectors is an efficient strategy instead of intersection/union of sets
11
Bit-Vector Representation of Bit-Vector Representation of SetsSets
123456789
101112
537432462054
RID Score 75 643210
1
1
1
1
1
1
1
11
1
1
1000000000000
00000
00
000
000000000
00
0
00
0000000
000
00
0000
000000000
0
0000000
0000
00
000000000
0 -- {10}
1 - {}
2 - {6,9}
3 - {2,5}
4 - {4,7,12}
5 - {1,11}
6 - {8}
7 - {3}
Score between 4 and 6 -- S4 U S5 U S6V4 or V5 or V6
12
Range Encoded Bit-VectorsRange Encoded Bit-Vectors
123456789
101112
537432462054
RID Score 75 643210
1
1
1
1
1
1
1
11
1
1
1000000000100
00000
00
100
000000000
00
0
00
1001100
010
11
0110
101111011
1
1101111
1111
11
111111111
Score = 4V4 V3
Score <= 4V4
Score >= 2 andScore <= 4V4 V1
13
Bit-VectorsBit-Vectors
1
23456789
1011121314
CS
CS
CSCS
CS
CS
EEME
EE
EE
ME
ME
AE
AE
M
M
M
M
M
MM
M
F
F
FF
F
F
H1
H1
H2
H2H2
H1
H1
H2
H3
H3
H3
H4
H4
H4
RID
Hos
tel
Gen
der
Dep
t
AE
CS
EE
ME
Dept
H1
H2
H3
H4
Hostel
M F
Gender
14
Merging RecordIdsMerging RecordIds
SELECT name, addressFROM studentsWHERE Dept = ‘CS’AND Hostel = ‘H2’
What is better?– Bit-wise AND or Intersection
Depends on how many records in each set– if the sets are very small, record-id
intersection will be faster than bit-wise and
15
Processing QueriesProcessing Queries
Dept = CSCI
Hostel =H2
Record-Set
Record-Set
Record-Set
Dept = CSCI
Hostel =H2
•Bit-vector Bit-vector
Bit-vector
Dept = CSCI
•
Bit-Vector
Bit-Vector
Record-Set
Hostel =H2
convert
16
Processing QueriesProcessing Queries
Convert from bit-vector to record-ids and vice-versa
For a record-id probe into the bit-vector Fast counting of bits to get counts --
extend to sum and average Skip empty blocks Deal with NULL values
17
N-way AND and ORsN-way AND and ORs
Dept = CSCI
Hostel =H2
•
Age =19
Age =20
Age =21
Age =22
+
Early Exit Strategy
18
Where are Bit-Vectors goodWhere are Bit-Vectors good
Equality predicate– select * from customer where state = ‘CA’
AND predicates– select * from customer where state = ‘CA’ and gender =
‘F’
OR predicates– select * from customer where state = ‘CA’ or state is
NULL
Queries with Negation– select * state from customer where state <> ‘CA’ and
age between 30 and 40
19
Aggregate QueriesAggregate Queries
Select count(*) from customer select count(age) from customer where
state = ‘CA’ select state, count(*) from customer group
by state
21
Bit-Vector IndicesBit-Vector Indices
The structure that maps value to record-id is the same
The Record List Area stores bit-vectors rather than record lists
22
Comparing Space Comparing Space RequirementsRequirements Consider a table with N (1M) values Consider an index on a domain with n
(100) values The value-based index is identical in both
cases
23
Calculating SpaceCalculating Space
Record-Id– 1 million records ids.– 32 bits * 1M records– 32M bits– 1 M words– N words
Per value– 100 values, (1,000,000/100
= 10,000) rids per value
Bit-Vectors– 1 million bits per value– 100 values – 100 * 1,000,000 bits =
100 M bits– ~ 3M words– N * n/32 words
n -- number of distinct values
N -- number of records
24
Can we do better?Can we do better?
For small domains ( < 32) bit-vectors are space efficient
For large domains, bit-vectors are sparse For very large domains, record-ids are the
best compression
small domains -- bit vectors, medium domains -- compress, large domains -- record-ids
25
Handling SkewHandling Skew
For many domains a large portion of values correspond to a few distinct values
Even though the number of unique values is large some domains are candidates for bit-vectors
Compression of bits to reduce space Dynamic selection of encoding strategy
26
CompressionCompression
Compressing bit-streams of 1s run-length encoding
– 111100001111 (1:4:9:4)– works well with large runs– for very large blocks of zeros, don’t store
anything
must deal with runs as “one” object
27
Inserts/Deletes UpdatesInserts/Deletes Updates
Delete and insert may require toggling a bit
However, if the number of rows increases, each
bitmap needs to be extended
Don’t map bits to rows but to blocks– shrinks the size of the bit-vector and more bits set --
better compression possible
– Does not do precise computation, can’t deal with NOT,
NULL etc.
Top Related