Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables...
Transcript of Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables...
![Page 1: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/1.jpg)
![Page 2: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/2.jpg)
ScyllaDB: Achieving No-CompromisePerformance
Avi Kivity, CTO@AviKivity(Hiring!)
![Page 3: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/3.jpg)
Agenda
BackgroundGoalsMethodsConclusion
![Page 4: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/4.jpg)
Non-Agenda
● Docker● Microservices● Node.js● Docker
● Orchestration● JVM GC Tuning● JSON over HTTP● Docker
![Page 5: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/5.jpg)
More Non-Agenda
● Cache lines, coherency protocols● NUMA● Algorithms are the only thing that matters,
everything else is implementation detail● Docker
![Page 6: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/6.jpg)
Background - ScyllaDB
● Clustered NoSQL database compatible with Apache Cassandra
● ~10X performance on same hardware● Low latency, esp. higher percentiles● Self tuning● C++14, fully asynchronous; Seastar!
![Page 7: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/7.jpg)
YCSB Benchmark:3 node Scylla cluster vs 3, 9, 15, 30Cassandra machines
3 Scylla30 Cassandra
3 Cassandra
3 Scylla
30 Cassandra
3 Cassandra
![Page 8: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/8.jpg)
Log-Structured Merge Tree
SStable 1
SStable 2
SStable 3Tim
e
SStable 4
SStable 5SStable 1+2+3
Foreground Job Background Job
![Page 9: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/9.jpg)
High-level Goals
● Efficiency:○ Make the most out of every cycle
● Utilization:○ Squeeze every cycle from the machine
● Control○ Spend the cycles on what we want, when we want
![Page 10: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/10.jpg)
Characterizing the problem
● Large numbers of small operations ○ Make coordination cheap
● Lots of communications○ Within the machine○ With disk○ With other machines
![Page 11: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/11.jpg)
Asynchrony,Everywhere
![Page 12: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/12.jpg)
![Page 13: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/13.jpg)
![Page 14: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/14.jpg)
● Thread-per-core design○ Never block
● Asynchronous networking● Asynchronous file I/O● Asynchronous multicore
General Architecture
![Page 15: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/15.jpg)
Scylla has its own task schedulerTraditional stack Scylla’s stack
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise
Task
Promise
Task
Promise
Task
Promise
Task
CPU
Promise is a pointer to eventually computed value
Task is a pointer to a lambda function
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Scheduler
CPU
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread
Stack
Thread is a function pointer
Stack is a byte array from 64k to megabytes
Context switch cost is
high. Large stacks pollutes
the caches No sharing, millions of
parallel events
![Page 16: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/16.jpg)
The Concurrency Dilemma
![Page 17: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/17.jpg)
Fundamental performance equation
Concurrency = Throughput * Latency
![Page 18: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/18.jpg)
Fundamental performance equation
Throughput = Concurrency
Latency
![Page 19: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/19.jpg)
Fundamental performance equation
Latency = Concurrency
Throughput
![Page 20: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/20.jpg)
Lower bounds for concurrency
● Disks want minimum iodepth for full throughput (heads/chips)
● Remote nodes need concurrency to hide network latency and their own min. concurrency
● Compute wants work for each core
![Page 21: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/21.jpg)
Results of Mathematical Analysis
● Want high concurrency (for throughput)● Want low concurrency (for latency)● Resources require concurrency for full
utilization
![Page 22: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/22.jpg)
Sources of concurrency
● Users○ Reduce concurrency / add nodes
● Internal processes○ Generate as much concurrency as possible○ Schedule
![Page 23: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/23.jpg)
Resource Scheduling
Sch
edul
er
Storage
8
User read
User write
Compaction (internal)
Streaming (internal)
30
12
50
50
![Page 24: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/24.jpg)
Why not the Linux I/O scheduler?
● Can only communicate priority by originating thread
● Will reorder/merge like crazy● Disable
![Page 25: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/25.jpg)
Figuring out optimal disk concurrency
Max useful disk concurrency
![Page 26: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/26.jpg)
Cache design
Cache files or objects?
![Page 27: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/27.jpg)
Using the kernel page cache
● 4k granularity● Thread-safe● Synchronous APIs● General-purpose● Lack of control (1)● Lack of control (2)
● Exists● Hundreds of
hacker-years● Handling lots of edge
cases
![Page 28: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/28.jpg)
Unified cacheCassandra Scylla
Key cache
Row cache
On-heap /Off-heap
Linux page cache
SSTables
Unified cache
SSTables
TuningParasitic rowsPage faults
App thread
Kernel
SSD
Page faultSuspend thread
Initiate I/OContext switch
I/O completesInterruptContext switch
Map pageResume thread
SSTable page (4k)
Your data (300b)
![Page 29: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/29.jpg)
Workload Conditioning
![Page 30: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/30.jpg)
Workload Conditioning• Internal feedback loops to balance competing loads
Memtable
Seastar SchedulerCompaction
Query
Repair
Commitlog
SSD
Compaction Backlog Monitor
Memory Monitor
Adjust priorityAdjust priority
WAN
CPU
![Page 31: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/31.jpg)
Replacing the system memory allocator
![Page 32: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/32.jpg)
System memory allocator problems
● Thread safe● Allocation back pressure
![Page 33: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/33.jpg)
Seastar memory allocator
● Non-Thread safe!○ Each core gets a private memory pool
● Allocation back pressure○ Allocator calls a callback when low on memory○ Scylla evicts cache in response
![Page 34: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/34.jpg)
One allocatoris not enough
![Page 35: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/35.jpg)
Remaining problems with malloc/free
● Memory gets fragmented over time○ If workload changes sizes of allocated objects
● Allocating a large contiguous block requires evicting most of cache
![Page 36: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/36.jpg)
OOM :(Memory
![Page 37: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/37.jpg)
Log-structured memory allocation
● The cache○ Large majority of memory allocated○ Small subset of allocation sites
● Teach allocator how to move allocated objects around○ Updating references
![Page 38: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/38.jpg)
Log-structured memory allocation
Fancy Animation
![Page 39: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/39.jpg)
Future Improvements
![Page 40: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/40.jpg)
Userspace TCP/IP stack
● Thread-per-core design● Use DPDK to drive hardware● Present as experimental mode
○ Needs more testing and productization
![Page 41: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/41.jpg)
Query Compilation to Native Code
● Use LLVM to JIT-compile CQL queries● Embed database schema and internal
object layouts into the query
![Page 42: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/42.jpg)
● Full control of the software stack can generate big payoffs
● Careful system design can maximize throughput● Without sacrificing latency● Without requiring endless end-user tuning● While having a lot of fun
Conclusions
![Page 43: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/43.jpg)
● Download: http://www.scylladb.com● Twitter: @ScyllaDB● Source: http://github.com/scylladb/scylla● Mailing lists: scylladb-user @ groups.google.com● Company site & blog: http://www.scylladb.com
How to interact
![Page 44: Performance ScyllaDB: Achieving No-Compromise · On-heap / Off-heap Linux page cache SSTables Unified cache SSTables Parasitic rowsPage faultsTuning App thread Kernel SSD Page fault](https://reader033.fdocuments.us/reader033/viewer/2022053008/5f0ba5c97e708231d43188a2/html5/thumbnails/44.jpg)
THE SCYLLA IS THE LIMITThank you.