fault codes - SeaStar · 2016-04-18fault codes - SeaStar Solutions
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
-
Upload
tzach-livyatan -
Category
Engineering
-
view
4.070 -
download
1
Transcript of Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Nadav Har'El, ScyllaDB
The Generalist Engineer meetup, Tel-AvivIdes of March, 2016
SeastarSeastar Or how we implemented a10-times faster Cassandra
2
● Israeli but multi-national startup company– 15 developers cherry-picked from 10 countries.
● Founded 2013 (“Cloudius Systems”)– by Avi Kivity and Dor Laor of KVM fame.
● Fans of open-source: OSv, Seastar, ScyllaDB.
3
Make Cassandra 10 times faster
Your mission, should you choose to accept it:
4
“Make Cassandra 10 times faster”
● Why 10?● Why Cassandra?
– Popular NoSQL database (2nd to MongoDB).
– Powerful and widely applicable.
– Example of a wider class of middleware.
● Why “mission impossible”?– Cassandra not considered particularly slow -
– Considered faster than MongoDB, Hbase, et al.
– “disk is bottleneck” (no longer, with SSD!)
5
Our first attempt: OSv
● New OS design specifically for cloud VMs:– Run a single application per VM (“unikernel”)– Run existing Linux applications (Cassandra)– Run these faster than Linux.
6
OSv
● Some of the many ideas we used in OSv:– Single address space.– System call is just a function call.– Faster context switches.– No spin locks.– Smaller code.– Redesigned network stack (Van Jacobson).
7
OSv
● Writing an entire OS from scratch was a really fun exercise for our generalist engineers.
● Full description of OSv is beyond the scope of this talk. Check out:– “OSv—Optimizing the Operating System for Virtual
Machines”, Usenix ATC 2014.
8
Cassandra on OSv
● Cassandra-stress, READ, 4 vcpu:
On OSv, 34% faster than Linux
● Very nice, but not even close to our goal.
What are the remaining bottlenecks?
9
Bottlenecks: API locks
● In one profile, we saw 20% of run on lock() and unlock() operations. Most uncontended– Posix APIs allow threads to share
● file descriptors● sockets
– As many as 20 lock/unlock for each network packet!
● Uncontended locks were efficient on UP (flag to disable preemption), But atomic operations slow on many cores.
10
Bottlenecks: API copies
● Write/send system calls copies user data to kernel– Even on OSv with no user-kernel separation
– Part of the socket API
● Similar for read
11
Bottlenecks: context switching
● One thread per CPU is optimal, >1 require:– Context switch time
– Stacks consume memory and polute CPU cache
– Thread imbalance
● Requires fully non-blocking APIs– Cassandra's uses mmap() for disk….
12
Bottlenecks: unscalable applications
● Contended locks ruin scalability to many cores– Memcache's counter and shared cache
● Solution: per-cpu data.
● Even lock-free atomic algorithms are unscalable– Cache line bouncing
● Again, better to shard, not share, data.
– Becomes worse as core count grows
● NUMA
13
Therefore
● Need to provide a better APIs for server applications– Not file descriptors, sockets, threads, etc.
● Need to write better applications.
14
Framework
● One thread per CPU– Event-driven programming
– Everything (network & disk) is non-blocking
– How to write complex applications?
15
Framework
● Sharded (shared-nothing) applications– Important!
16
Framework
● Language with no runtime overheads or built-in data sharing
17
Seastar
● C++14 library● For writing new high-performance server applications● Share-nothing model, fully asynchronous● Futures & Continuations based
– Unified API for all asynchronous operations– Compose complex asyncrhonous operations– The key to complex applications
● (Optionally) full zero-copy user-space TCP/IP (over DPDK)● Open source: http://www.seastar-project.org/
18
Seastar linear scaling in #cores
19
Seastar linear scaling in #cores
20
Brief introduction to Seastar
21
Sharded application design
● One thread per CPU● Each thread handles one shard of data
– No shared data (“share nothing”)
– Separate memory per CPU (NUMA aware)
– Message-passing between CPUs
– No locks or cache line bounces
● Reactor (event loop) per thread● User-space network stack also sharded
22
Futures and continuations
● Futures and continuations are the building blocks of asynchronous programming in Seastar.
● Can be composed together to a large, complex, asynchronous program.
23
Futures and continuations
● A future is a result which may not be available yet:– Data buffer from the network
– Timer expiration
– Completion of a disk write
– The result of a computation which requires the values from one or more other futures.
● future<int>
● future<>
24
Futures and continuations
● An asynchronous function (also “promise”) is a function returning a future:– future<> sleep(duration)
– future<temporary_buffer<char>> read()
● The function sets up for the future to be fulfilled– sleep() sets a timer to fulfill the future it returns
25
Futures and continuations
● A continuation is a callback, typically a lambda executed when a future becomes ready– sleep(1s).then([] { std::cerr << “done”;});
● A continuation can hold state (lambda capture)– future<int> slow_incr(int i) { sleep(10ms).then( [i] { return i+1; });}
26
Futures and continuations
● Continuations can be nested:– future<int> get();future<> put(int);get().then([] (int value) { put(value+1).then([] { std::cout << “done”; });});
● Or chained:– get().then([] (int value) { return put(value+1);}).then([] { std::cout << “done”;});
27
Futures and continuations
● Parallelism is easy:– sleep(100ms).then([] { std::cout << “100ms\n”;});sleep(200ms).then([] { std::cout << “200ms\n”;
28
Futures and continuations
● In Seastar, every asynchronous operation is a future:– Network read or write
– Disk read or write
– Timers
– …
– A complex combination of other futures
● Useful for everything from writing network stack to writing a full, complex, application.
29
Network zero-copy
● future<temporary_buffer>input_stream::read()– temporary_buffer points at driver-provided pages, if
possible.
– Automatically discarded after use (C++).
● future<> output_stream::write(temporary_buffer)– Future becomes ready when TCP window allows further
writes (usually immediately).
– Buffer discarded after data is ACKed.
30
Two TCP/IP implementations
Networking API
Seastar (native) Stack POSIX (hosted) stack
Linux kernel (sockets)
User-space TCP/IP
Interface layer
DPDK
Virtio Xen
igb ixgb
31
Disk I/O
● Asynchronous and zero copy, using AIO and O_DIRECT.
● Not implemented well by all filesystems– XFS recommended
● Focusing on SSD● Future thought:
– Direct NVMe support,
– Implement filesystem in Seastar.
32
More info on Seastar
● http://seastar-project.com● https://github.com/scylladb/seastar● http://docs.seastar-project.org/● http://docs.seastar-project.org/master/md_doc_tu
torial.html
33
ScyllaDB
● NoSQL database, implemented in Seastar.● Fully compatible with Cassandra:
– Same CQL queries
– Copy over a complete Cassandra database
– Use existing drivers
– Use existing cassandra.yaml
– Use same nodetool or JMX console
– Can be clustered (of course...)
34
ScyllaDBCassandra
Key cache
Row cache
On-heap /Off-heap
Linux page cache
SSTables
Unified cache
SSTables
● Don't double-cache.● Don't cache unrelated rows.● Don't cache unparsed sstables.● Can fit much more into cache.● No page faults, threads, etc.
35
Scylla vs. Cassandra
● Single node benchmark:– 2 x 12-core x 2 hyperthread Intel(R) Xeon(R) CPU
E5-2690 v3 @ 2.60GHz
cassandra-stress Benchmark
ScyllaDB Cassandra
Write 1,871,556 251,785
Read 1,585,416 95,874
Mixed 1,372,451 108,947
36
Scylla vs. Cassandra
● We really got a x7 – x16 speedup!● Read speeded up more -
– Cassandra writes are simpler
– Row-cache benefits further improve Scylla's read
● Almost 2 million writes per second on single machine!– Google reported in their blogs achieving 1 million writes
per second on 330 (!) machines
– (2 years ago, and RF=3… but still impressive).
37
Scylla vs. Cassandra3 node cluster, 2x12 cores each; RF=3, CL=quorum
38
Better latency, at all load levels
39
What will you do with 10x performance?
● Shrink your cluster by a factor of 10● Use stronger (but slower) data models● Run more queries - more value from your data● Stop using caches in front of databases
40
41
Do we qualify?
In 3 years, our small team wrote:● A complete kernel and library (OSv).● An asynchronous programming framework
(Seastar).● A complete Cassandra-compatible NoSQL
database (ScyllaDB).
42
43
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 645402.