SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016
-
Upload
codemotion -
Category
Technology
-
view
211 -
download
0
Transcript of SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016
![Page 1: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/1.jpg)
@doanduyhai
SASI, Cassandra on the full text search ride
DuyHai DOAN Apache Cassandra Evangelist
AMSTERDAM 11-12 MAY 2016
![Page 2: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/2.jpg)
@doanduyhai
Who Am I ?Duy Hai DOAN Apache Cassandra Evangelist
• talks, meetups, confs
• open-source devs (Achilles, Apache Zeppelin…)
• OSS Cassandra point of contact ☞ [email protected] ☞ @doanduyhai
2
![Page 3: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/3.jpg)
@doanduyhai
Datastax• Founded in April 2010
• We contribute a lot to Apache Cassandra™
• 400+ customers (25 of the Fortune 100), 450+ employees
• Headquarter in San Francisco Bay area
• EU headquarter in London, offices in France and Germany
• Datastax Enterprise = OSS Cassandra + extra features
3
![Page 4: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/4.jpg)
SASI Index• What is SASI ? • Distributed Index • Life-cycle • Query Planner
![Page 5: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/5.jpg)
What is SASI ?
![Page 6: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/6.jpg)
@doanduyhai
Who ?• Open source contribution by an engineers team
6
![Page 7: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/7.jpg)
@doanduyhai
How ?
7
New secondary index re-designed from scratch• follow SSTable life-cycle (flush, compaction)• new data-strutures • full text search options• no dependency on Apache Lucene
SASI = SSTable-Attached Secondary Index
![Page 8: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/8.jpg)
SASI Demo
![Page 9: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/9.jpg)
SASI Demo 9
![Page 10: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/10.jpg)
Distributed Index
![Page 11: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/11.jpg)
@doanduyhai
Index on user country
11
H
A
E
D
B C
G F
FR user1 user102 … user493
US user54 user483 … user938
FR user87 user176 … user987
FR user17 user409 … user787
![Page 12: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/12.jpg)
@doanduyhai
Distributed search query handling
12
H
A
E
D
B C
G F
coordinator
1st roundConcurrency factor = 1
![Page 13: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/13.jpg)
@doanduyhai
Distributed search query handling
13
H
A
E
D
B C
G F
coordinator
Not enough results ?
![Page 14: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/14.jpg)
@doanduyhai
Distributed search query handling
14
H
A
E
D
B C
G F
coordinator
2nd roundConcurrency factor = 2
![Page 15: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/15.jpg)
@doanduyhai
Distributed search query handling
15
H
A
E
D
B C
G F
coordinator
Still not enough results ?
![Page 16: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/16.jpg)
@doanduyhai
Distributed search query handling
16
H
A
E
D
B C
G F
coordinator
3rd roundConcurrency factor = 4
![Page 17: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/17.jpg)
@doanduyhai
Caveat 1: query with non-restrictive filters
17
H
A
E
D
B C
G F
coordinator
Hit all nodes L
![Page 18: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/18.jpg)
@doanduyhai
Caveat 1 solution: always use LIMIT
18
H
A
E
D
B C
G F
coordinator
SELECT * FROM …
WHERE ... LIMIT 1000
![Page 19: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/19.jpg)
@doanduyhai
Caveat 2: 1-to-1 index (user_email)
19
H
A
E
D
B C
G F
coordinator
Not found WHERE user_email LIKE '%xxx%'
![Page 20: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/20.jpg)
@doanduyhai
Caveat 2: 1-to-1 index (user_email)
20
H
A
E
D
B C
G F
coordinator
Still no result
WHERE user_email LIKE '%xxx%'
![Page 21: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/21.jpg)
@doanduyhai
Caveat 2: 1-to-1 index (user_email)
21
H
A
E
D
B C
G F
coordinator
At best 1 user foundAt worst 0 user found
WHERE user_email LIKE '%xxx%'
![Page 22: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/22.jpg)
@doanduyhai
Caveat 2 solution: use materalized views
22
For 1-to-1 index/relationship, use materialized views instead
CREATE MATERIALIZED VIEW user_by_email ASSELECT * FROM usersWHERE user_id IS NOT NULL and user_email IS NOT NULLPRIMARY KEY (user_email, user_id)
![Page 23: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/23.jpg)
@doanduyhai
Caveat 3: fetch all rows for analytics use-case
23
H
A
E
D
B C
G F
coordinator
Hit all nodes L
![Page 24: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/24.jpg)
@doanduyhai
Caveat 3 solution: use co-located Apache Spark
24
H
A
E
D
B C
G F
Local index filtering in Cassandra Aggregation in Spark
Local index query
![Page 25: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/25.jpg)
25
Q & A
! "
![Page 26: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/26.jpg)
SASI Life-cycle
![Page 27: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/27.jpg)
@doanduyhai
SASI Life-cycle: in-memory
27
Commit log1
. . .
1
Commit log2
Commit logn
Memory
. . . MemTable Table1
MemTable Table2
MemTable TableN
2
Index MemTable1
Index MemTable2
. . . Index
MemTableN 3
ACK the client
![Page 28: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/28.jpg)
@doanduyhai
IndexMemtable
28
Index mode, data type Data structure Usage PREFIX, text Guava ConcurrentRadixTree name LIKE 'John%'
CONTAINS, text Guava ConcurrentSuffixTree name LIKE ’%John%'name LIKE ’%ny’
PREFIX, other JDK ConcurrentSkipListSet age = 20age >= 20 AND age <= 30
SPARSE, other JDK ConcurrentSkipListSet age = 20age >= 20 AND age <= 30
![Page 29: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/29.jpg)
@doanduyhai
SASI Life-cycle: flush to SSTable
29
Commit log1
. . .
1
Commit log2
Commit logn
Memory
Table1
SStable1
Table2 Table3
SStable2 SStable3 4
OnDiskIndex1
OnDiskIndex2 OnDiskIndex3
![Page 30: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/30.jpg)
@doanduyhai
SASI Life-cycle: compaction
30
SSTable1 SSTable2 SSTable3
New SSTable
OnDiskIndex1 OnDiskIndex2 OnDiskIndex3
New OnDiskIndex
![Page 31: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/31.jpg)
@doanduyhai
OnDiskIndex Files
31
SStable1
SStable2
user_id4 FR user_id1 US user_id5 FR
user_id3 UK user_id2 DE
OnDiskIndex1
FR US
OnDiskIndex2
UK DE
![Page 32: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/32.jpg)
@doanduyhai
OnDiskIndex Files
32
SStable1
SStable2
user_id4 FR user_id1 US user_id5 FR
user_id3 UK user_id2 DE
OnDiskIndex1
FR US
OnDiskIndex2
UK DE
Suffix Tree Data structures
![Page 33: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/33.jpg)
33
Q & A
! "
![Page 34: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/34.jpg)
Query Planner
![Page 35: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/35.jpg)
@doanduyhai
Integrated query planner
35
Perform optimizations on predicates1. build predicates tree 2. predicates push-down & re-ordering3. predicate fusions for != operator
![Page 36: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/36.jpg)
@doanduyhai
Query optimization example
36
WHERE age < 100 AND fname = 'p*' AND first_name != 'pa*' AND age > 21
![Page 37: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/37.jpg)
@doanduyhai
Query optimization example
37
AND is associative and commutative
![Page 38: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/38.jpg)
@doanduyhai
Query optimization example
38
!= transformed to exclusion on range scan
![Page 39: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/39.jpg)
@doanduyhai
Query optimization example
39
AND is associative and commutative
![Page 40: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/40.jpg)
40
Q & A
! "
![Page 41: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/41.jpg)
Some Benchmarks
![Page 42: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/42.jpg)
@doanduyhai
Hardware specs
42
13 bare-metal machines • 6 CPU HT (12 vcores)• 64Gb RAM• 4 SSDs in RAID0 for a total of 1.5Tb
Data set• 13 billions of rows• 1 numerical index with 36 distinct values • 2 text index with 7 distinct values • 1 text index with 3 distinct values
![Page 43: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/43.jpg)
@doanduyhai
Benchmark results
43
![Page 44: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/44.jpg)
@doanduyhai
Benchmark results
44
![Page 45: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/45.jpg)
@doanduyhai
Benchmark results
45
![Page 46: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/46.jpg)
@doanduyhai
Benchmark results
46
![Page 47: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/47.jpg)
@doanduyhai
Benchmark results
47
Full scan using server-side paging
Predicate count Fetched rows Query time in sec 1 36 109 986 6092 2 781 492 3303 1 044 547 3724 360 334 116
![Page 48: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/48.jpg)
Take Away
![Page 49: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/49.jpg)
@doanduyhai
Conclusion
49
Is it available ?• yes in Cassandra 3.5
Future enhancement ?• index on collections (List, Set & Map) !• OR clause (WHERE (xxx OR yyy) AND zzz )• != operator
![Page 50: SASI, Cassandra on the full text search ride - DuyHai Doan - Codemotion Amsterdam 2016](https://reader031.fdocuments.us/reader031/viewer/2022022200/58a32ed21a28ab9b6d8b4577/html5/thumbnails/50.jpg)
@doanduyhai
Conclusion
50
SASI vs Solr/ElasticSearch ?• Cassandra is not a search engine !!! (database = durability) • always slower because 2 passes (SASI index read + original Cassandra data)• no scoring • no ordering (ORDER BY)• no grouping (GROUP BY) à Apache Spark for analytics
Still, SASI covers 80% of search use-cases and people are happy !