Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra

Nadav Har'El, ScyllaDB

The Generalist Engineer meetup, Tel-AvivIdes of March, 2016

SeastarSeastar Or how we implemented a10-times faster Cassandra

2

● Israeli but multi-national startup company– 15 developers cherry-picked from 10 countries.

● Founded 2013 (“Cloudius Systems”)– by Avi Kivity and Dor Laor of KVM fame.

● Fans of open-source: OSv, Seastar, ScyllaDB.

3

Make Cassandra 10 times faster

Your mission, should you choose to accept it:

4

“Make Cassandra 10 times faster”

● Why 10?● Why Cassandra?

– Popular NoSQL database (2nd to MongoDB).

– Powerful and widely applicable.

– Example of a wider class of middleware.

● Why “mission impossible”?– Cassandra not considered particularly slow -

– Considered faster than MongoDB, Hbase, et al.

– “disk is bottleneck” (no longer, with SSD!)

5

Our first attempt: OSv

● New OS design specifically for cloud VMs:– Run a single application per VM (“unikernel”)– Run existing Linux applications (Cassandra)– Run these faster than Linux.

6

OSv

● Some of the many ideas we used in OSv:– Single address space.– System call is just a function call.– Faster context switches.– No spin locks.– Smaller code.– Redesigned network stack (Van Jacobson).

7

OSv

● Writing an entire OS from scratch was a really fun exercise for our generalist engineers.

● Full description of OSv is beyond the scope of this talk. Check out:– “OSv—Optimizing the Operating System for Virtual

Machines”, Usenix ATC 2014.

8

Cassandra on OSv

● Cassandra-stress, READ, 4 vcpu:

On OSv, 34% faster than Linux

● Very nice, but not even close to our goal.

What are the remaining bottlenecks?

9

Bottlenecks: API locks

● In one profile, we saw 20% of run on lock() and unlock() operations. Most uncontended– Posix APIs allow threads to share

● file descriptors● sockets

– As many as 20 lock/unlock for each network packet!

● Uncontended locks were efficient on UP (flag to disable preemption), But atomic operations slow on many cores.

10

Bottlenecks: API copies

● Write/send system calls copies user data to kernel– Even on OSv with no user-kernel separation

– Part of the socket API

● Similar for read

11

Bottlenecks: context switching

● One thread per CPU is optimal, >1 require:– Context switch time

– Stacks consume memory and polute CPU cache

– Thread imbalance

● Requires fully non-blocking APIs– Cassandra's uses mmap() for disk….

12

Bottlenecks: unscalable applications

● Contended locks ruin scalability to many cores– Memcache's counter and shared cache

● Solution: per-cpu data.

● Even lock-free atomic algorithms are unscalable– Cache line bouncing

● Again, better to shard, not share, data.

– Becomes worse as core count grows

● NUMA

13

Therefore

● Need to provide a better APIs for server applications– Not file descriptors, sockets, threads, etc.

● Need to write better applications.

14

Framework

● One thread per CPU– Event-driven programming

– Everything (network & disk) is non-blocking

– How to write complex applications?

15

Framework

● Sharded (shared-nothing) applications– Important!

16

Framework

● Language with no runtime overheads or built-in data sharing

17

Seastar

● C++14 library● For writing new high-performance server applications● Share-nothing model, fully asynchronous● Futures & Continuations based

– Unified API for all asynchronous operations– Compose complex asyncrhonous operations– The key to complex applications

● (Optionally) full zero-copy user-space TCP/IP (over DPDK)● Open source: http://www.seastar-project.org/

http://www.seastar-project.org/

18

Seastar linear scaling in #cores

19

Seastar linear scaling in #cores

20

Brief introduction to Seastar

21

Sharded application design

● One thread per CPU● Each thread handles one shard of data

– No shared data (“share nothing”)

– Separate memory per CPU (NUMA aware)

– Message-passing between CPUs

– No locks or cache line bounces

● Reactor (event loop) per thread● User-space network stack also sharded

22

Futures and continuations

● Futures and continuations are the building blocks of asynchronous programming in Seastar.

● Can be composed together to a large, complex, asynchronous program.

23


● A future is a result which may not be available yet:– Data buffer from the network

– Timer expiration

– Completion of a disk write

– The result of a computation which requires the values from one or more other futures.

● future<int>

● future<>

24


● An asynchronous function (also “promise”) is a function returning a future:– future<> sleep(duration)

– future<temporary_buffer<char>> read()

● The function sets up for the future to be fulfilled– sleep() sets a timer to fulfill the future it returns

25


● A continuation is a callback, typically a lambda executed when a future becomes ready– sleep(1s).then([] { std::cerr << “done”;});

● A continuation can hold state (lambda capture)– future<int> slow_incr(int i) { sleep(10ms).then( [i] { return i+1; });}

26


● Continuations can be nested:– future<int> get();future<> put(int);get().then([] (int value) { put(value+1).then([] { std::cout << “done”; });});

● Or chained:– get().then([] (int value) { return put(value+1);}).then([] { std::cout << “done”;});

27


● Parallelism is easy:– sleep(100ms).then([] { std::cout << “100ms\n”;});sleep(200ms).then([] { std::cout << “200ms\n”;

28


● In Seastar, every asynchronous operation is a future:– Network read or write

– Disk read or write

– Timers

– …

– A complex combination of other futures

● Useful for everything from writing network stack to writing a full, complex, application.

29

Network zero-copy

● future<temporary_buffer>input_stream::read()– temporary_buffer points at driver-provided pages, if

possible.

– Automatically discarded after use (C++).

● future<> output_stream::write(temporary_buffer)– Future becomes ready when TCP window allows further

writes (usually immediately).

– Buffer discarded after data is ACKed.

30

Two TCP/IP implementations

Networking API

Seastar (native) Stack POSIX (hosted) stack

Linux kernel (sockets)

User-space TCP/IP

Interface layer

DPDK

Virtio Xen

igb ixgb

31

Disk I/O

● Asynchronous and zero copy, using AIO and O_DIRECT.

● Not implemented well by all filesystems– XFS recommended

● Focusing on SSD● Future thought:

– Direct NVMe support,

– Implement filesystem in Seastar.

32

More info on Seastar

● http://seastar-project.com● https://github.com/scylladb/seastar● http://docs.seastar-project.org/● http://docs.seastar-project.org/master/md_doc_tu

torial.html

http://seastar-project.com/

https://github.com/scylladb/seastar

http://docs.seastar-project.org/

http://docs.seastar-project.org/master/md_doc_tutorial.html

http://docs.seastar-project.org/master/md_doc_tutorial.html

33

ScyllaDB

● NoSQL database, implemented in Seastar.● Fully compatible with Cassandra:

– Same CQL queries

– Copy over a complete Cassandra database

– Use existing drivers

– Use existing cassandra.yaml

– Use same nodetool or JMX console

– Can be clustered (of course...)

34

ScyllaDBCassandra

Key cache

Row cache

On-heap /Off-heap

Linux page cache

SSTables

Unified cache

SSTables

● Don't double-cache.● Don't cache unrelated rows.● Don't cache unparsed sstables.● Can fit much more into cache.● No page faults, threads, etc.

35

Scylla vs. Cassandra

● Single node benchmark:– 2 x 12-core x 2 hyperthread Intel(R) Xeon(R) CPU

E5-2690 v3 @ 2.60GHz

cassandra-stress Benchmark

ScyllaDB Cassandra

Write 1,871,556 251,785

Read 1,585,416 95,874

Mixed 1,372,451 108,947

36

Scylla vs. Cassandra

● We really got a x7 – x16 speedup!● Read speeded up more -

– Cassandra writes are simpler

– Row-cache benefits further improve Scylla's read

● Almost 2 million writes per second on single machine!– Google reported in their blogs achieving 1 million writes

per second on 330 (!) machines

– (2 years ago, and RF=3… but still impressive).

37

Scylla vs. Cassandra3 node cluster, 2x12 cores each; RF=3, CL=quorum

38

Better latency, at all load levels

39

What will you do with 10x performance?

● Shrink your cluster by a factor of 10● Use stronger (but slower) data models● Run more queries - more value from your data● Stop using caches in front of databases

41

Do we qualify?

In 3 years, our small team wrote:● A complete kernel and library (OSv).● An asynchronous programming framework

(Seastar).● A complete Cassandra-compatible NoSQL

database (ScyllaDB).

43

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 645402.

Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra

Engineering

Transcript of Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra