Apache Cassandra in Bangalore - Cassandra Internals and Performance
-
Upload
aaronmorton -
Category
Technology
-
view
1.076 -
download
2
description
Transcript of Apache Cassandra in Bangalore - Cassandra Internals and Performance
BANGALORE CASSANDRA UG APRIL 2013
CASSANDRA INTERNALS & PERFORMANCE
Aaron Morton@aaronmorton
www.thelastpickle.com
Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License
ArchitectureCode
Cassandra Architecture
API's
Cluster Aware
Cluster Unaware
Clients
Disk
Cassandra Cluster Architecture
API's
Cluster Aware
Cluster Unaware
Clients
Disk
API's
Cluster Aware
Cluster Unaware
Disk
Node 1 Node 2
Dynamo Cluster Architecture
API's
Dynamo
Database
Clients
Disk
API's
Dynamo
Database
Disk
Node 1 Node 2
ArchitectureAPI
DynamoDatabase
API Transports
ThriftNative Binary
Read LineRMI
Thrift Transport
//Custom TServer implementations
o.a.c.thrift.CustomTThreadPoolServero.a.c.thrift.CustomTNonBlockingServero.a.c.thrift.CustomTHsHaServer
API Transports
ThriftNative Binary
Read LineRMI
Native Binary Transport
Beta in Cassandra 1.2Uses Netty 3.5Enabled with
start_native_transport(Disabled by default)
o.a.c.transport.Server.run()
//Setup the Netty servernew ExecutionHandler()new NioServerSocketChannelFactory()ServerBootstrap.setPipelineFactory()
o.a.c.transport.Message.Dispatcher.messageReceived()
//Process message from clientServerConnection.validateNewMessage()Request.execute()ServerConnection.applyStateTransition()Channel.write()
o.a.c.transport.messages
CredentialsMessage()EventMessage()ExecuteMessage()PrepareMessage()QueryMessage()ResultMessage()
(And more...)
Messages
Defined in the Native Binary Protocol
$SRC/doc/native_protocol.spec
API Services
JMXCLI
ThriftCQL 3
JMX Management Beans
Spread around the code base.
Interfaces named *MBean
JMX Management Beans
Registered with the names such as
org.apache.cassandra.db:type=StorageProxy
API Services
JMXCLI
ThriftCQL 3
o.a.c.cli.CliMain.main()
// Connect to server to read inputthis.connect()this.evaluateFileStatements()this.processStatementInteractive()
CLI Grammar
ANTLR Grammar$SRC/src/java/o/a/c/cli/CLI.g
o.a.c.cli.CliClient.executeCLIStatement()
// Process statementCliCompiler.compileQuery() #ANTLRswitch (tree.getType()) case...
API Services
JMXCLI
ThriftCQL 3
o.a.c.thrift.CassandraServer
// Implements Thrift Interface// Access control// Input validation// Mapping to/from Thrift and internal types
Thrift Interface
Thrift IDL$SRC/interface/cassandra.thrift
o.a.c.thrift.CassandraServer.get_slice()
// get columns for one rowTracing.begin()ClientState cState = state()cState.hasColumnFamilyAccess()multigetSliceInternal()
CassandraServer.multigetSliceInternal()
// get columns for may rowsThriftValidation.validate*()// Create ReadCommandsgetSlice()
CassandraServer.getSlice()
// Process ReadCommands// return Thrift types
readColumnFamily()thriftifyColumnFamily()
CassandraServer.readColumnFamily()
// Process ReadCommands// Return ColumnFamilies
StorageProxy.read()
API Services
JMXCLI
ThriftCQL 3
o.a.c.cql3.QueryProcessor
// Prepares and executes CQL3 statements// Used by Thrift & Native transports// Access control// Input validation// Returns transport.ResultMessage
CQL3 Grammar
ANTLR Grammar$SRC/o.a.c.cql3/Cql.g
o.a.c.cql3.statements.ParsedStatement
// Subclasses generated by ANTLR// Tracks bound term count// Prepare CQLStatementprepare()
o.a.c.cql3.statements.CQLStatement
checkAccess(ClientState state)validate(ClientState state)execute(ConsistencyLevel cl, QueryState state, List<ByteBuffer> variables)
o.a.c.cql3.functions.Function
argsType()returnType()execute(List<ByteBuffer> parameters)
statements.SelectStatement.RawStatement
// Implements ParsedStatement// Input validationprepare()
statements.SelectStatement.execute()
// Create ReadCommandsStorageProxy.read()
ArchitectureAPI
DynamoDatabase
Dynamo Layero.a.c.service
o.a.c.neto.a.c.dht
o.a.c.locatoro.a.c.gms
o.a.c.stream
o.a.c.service.StorageProxy
// Cluster wide storage operations// Select endpoints & check CL available// Send messages to Stages// Wait for response// Store Hints
o.a.c.service.StorageService
// Ring operations// Track ring state// Start & stop ring membership// Node & token queries
o.a.c.service.IResponseResolver
preprocess(MessageIn<T> message)resolve() throws DigestMismatchException
RowDigestResolverRowDataResolverRangeSliceResponseResolver
Response Handlers / Callback
implements IAsyncCallback<T>
response(MessageIn<T> msg)
o.a.c.service.ReadCallback.get()
//Wait for blockfor & datacondition.await(timeout, TimeUnit.MILLISECONDS)
throw ReadTimeoutException()
resolver.resolve()
o.a.c.service.StorageProxy.fetchRows()
getLiveSortedEndpoints()new RowDigestResolver()new ReadCallback()MessagingService.sendRR()---------------------------------------ReadCallback.get() # blockingcatch (DigestMismatchException ex)catch (ReadTimeoutException ex)
Dynamo Layero.a.c.service
o.a.c.neto.a.c.dht
o.a.c.locatoro.a.c.gms
o.a.c.stream
o.a.c.net.MessagingService.verb<<enum>>
MUTATIONREADREQUEST_RESPONSETREE_REQUESTTREE_RESPONSE
(And more...)
o.a.c.net.MessagingService.verbHandlers
new EnumMap<Verb, IVerbHandler>(Verb.class)
o.a.c.net.IVerbHandler<T>
doVerb(MessageIn<T> message, String id);
o.a.c.net.MessagingService.verbStages
new EnumMap<MessagingService.Verb, Stage>(MessagingService.Verb.class)
o.a.c.net.MessagingService.receive()
runnable = new MessageDeliveryTask( message, id, timestamp);
StageManager.getStage( message.getMessageType());
stage.execute(runnable);
o.a.c.net.MessageDeliveryTask.run()
// If dropable and rpc_timeoutMessagingService.incrementDroppedMessag
es(verb);
MessagingService.getVerbHandler(verb)verbHandler.doVerb(message, id)
Dynamo Layero.a.c.service
o.a.c.neto.a.c.dht
o.a.c.locatoro.a.c.gms
o.a.c.stream
o.a.c.dht.IPartitioner<T extends Token>
getToken(ByteBuffer key)getRandomToken()
LocalPartitionerRandomPartitionerMurmur3Partitioner
o.a.c.dht.Token<T>
compareTo(Token<T> o)
BytesTokenBigIntegerTokenLongToken
Dynamo Layero.a.c.service
o.a.c.neto.a.c.dht
o.a.c.locatoro.a.c.gms
o.a.c.stream
o.a.c.locator.IEndpointSnitch
getRack(InetAddress endpoint)getDatacenter(InetAddress endpoint)sortByProximity(InetAddress address,
List<InetAddress> addresses)
SimpleSnitchPropertyFileSnitchEc2MultiRegionSnitch
o.a.c.locator.AbstractReplicationStrategy
getNaturalEndpoints( RingPosition searchPosition)calculateNaturalEndpoints(Token searchToken, TokenMetadata tokenMetadata)
SimpleStrategyNetworkTopologyStrategy
o.a.c.locator.TokenMetadata
BiMultiValMap<Token, InetAddress> tokenToEndpointMapBiMultiValMap<Token, InetAddress> bootstrapTokensSet<InetAddress> leavingEndpoints
Dynamo Layero.a.c.service
o.a.c.neto.a.c.dht
o.a.c.locatoro.a.c.gms
o.a.c.stream
o.a.c.gms.VersionedValue
// VersionGenerator.getNextVersion()
public final int version;public final String value;
o.a.c.gms.ApplicationState<<enum>>
STATUSLOADSCHEMADCRACK
(And more...)
o.a.c.gms.HeartBeatState
//VersionGenerator.getNextVersion();
private int generation;private int version;
o.a.c.gms.Gossiper.GossipTask.run()
// SYN -> ACK -> ACK2makeRandomGossipDigest()new GossipDigestSyn()
// Use MessagingService.sendOneWay()Gossiper.doGossipToLiveMember()Gossiper.doGossipToUnreachableMember()Gossiper.doGossipToSeed()
gms.GossipDigestSynVerbHandler.doVerb()
Gossiper.examineGossiper()new GossipDigestAck()MessagingService.sendOneWay()
gms.GossipDigestAckVerbHandler.doVerb()
Gossiper.notifyFailureDetector()Gossiper.applyStateLocally()Gossiper.makeGossipDigestAck2Message()
gms.GossipDigestAcksVerbHandler.doVerb()
Gossiper.notifyFailureDetector()Gossiper.applyStateLocally()
ArchitectureAPI Layer
Dynamo LayerDatabase Layer
Database Layero.a.c.concurrent
o.a.c.db
o.a.c.cacheo.a.c.io
o.a.c.trace
o.a.c.concurrent.StageManager
stages = new EnumMap<Stage, ThreadPoolExecutor>(Stage.class);
getStage(Stage stage)
o.a.c.concurrent.Stage
READMUTATIONGOSSIPREQUEST_RESPONSEANTI_ENTROPY
(And more...)
Database Layero.a.c.concurrent
o.a.c.db
o.a.c.cacheo.a.c.io
o.a.c.trace
o.a.c.db.Table
// Keyspaceopen(String table)getColumnFamilyStore(String cfName)
getRow(QueryFilter filter)apply(RowMutation mutation, boolean writeCommitLog)
o.a.c.db.ColumnFamilyStore
// Column FamilygetColumnFamily(QueryFilter filter)getTopLevelColumns(...)
apply(DecoratedKey key, ColumnFamily columnFamily, SecondaryIndexManager.Updater indexer)
o.a.c.db.IColumnContainer
addColumn(IColumn column)remove(ByteBuffer columnName)
ColumnFamilySuperColumn
o.a.c.db.ISortedColumns
addColumn(IColumn column, Allocator allocator)removeColumn(ByteBuffer name)
ArrayBackedSortedColumnsAtomicSortedColumnsTreeMapBackedSortedColumns
o.a.c.db.Memtable
put(DecoratedKey key, ColumnFamily columnFamily, SecondaryIndexManager.Updater indexer)
flushAndSignal(CountDownLatch latch, Future<ReplayPosition> context)
Memtable.FlushRunnable.writeSortedContents()
// SSTableWritercreateFlushWriter()
// Iterate through rows & CF’s in orderwriter.append()
o.a.c.db.ReadCommand
getRow(Table table)
SliceByNamesReadCommandSliceFromReadCommand
o.a.c.db.IDiskAtomFilter
getMemtableColumnIterator(...)getSSTableColumnIterator(...)
IdentityQueryFilterNamesQueryFilterSliceQueryFilter
Some query performance...
Today.
Write PathRead Path
memtable_flush_queue_size test...
m1.xlarge Cassandra nodem1.xlarge client node
1 CF with 6 Secondary Indexes1 Client Thread
10,000 Inserts, 100 Columns per Row1100 bytes per Column
CF write latency and memtable_flush_queue_size...
0
300
600
900
1,200
85th 95th 99th 100th
Late
ncy
Micr
osec
onds
memtable_flush_queue_size=7 memtable_flush_queue_size=1
Request latency and memtable_flush_queue_size...
0
1,250,000
2,500,000
3,750,000
5,000,000
85th 95th 99th 100th
Late
cy M
icros
econ
ds
memtable_flush_queue_size=7 memtable_flush_queue_size=1
durable_writes test...
10,000 Inserts, 50 Columns per Row50 bytes per Column
Request latency and durable_writes (1 client)...
0
1,750
3,500
5,250
7,000
85th 95th 99th
Late
ncy
Micr
osec
onds
enabled disabled
Request latency and durable_writes (10 clients)...
0
7,500
15,000
22,500
30,000
85th 95th 99th
Late
ncy
Micr
osec
onds
enabled disabled
Request latency and durable_writes (20 clients)...
0
22,500
45,000
67,500
90,000
85th 95th 99th
Late
ncy
Micr
osec
onds
enabled disabled
CommitLog tests...
10,000 Inserts, 50 Columns per Row50 bytes per Column
periodic commit log adds mutation to queue then acknowledges.
Commit Log is appended to by a single thread, sync is called every
commitlog_sync_period_in_ms.
Request latency and commitlog_sync_period_in_ms...
170
183
195
208
220
85th 95th 99th
Late
cy M
icros
econ
ds
10,000 ms 10 ms
batch commit log adds mutation to queue and waits before acknowledging.
Writer thread processes mutations for commitlog_sync_batch_window_in_
ms duration, then syncs, then signals.
Request latency comparing periodic and batch sync...
0
200
400
600
800
85th 95th 99th
Late
cy M
icros
econ
ds
periodic batch
Merge mutation...
Row level Isolation provided via SnapTree.
(https://github.com/nbronson/snaptree)
Row concurrency tests...
10,000 Columns per Row50 bytes per Column
50 Columns per Insert
CF Write Latency and row concurrency (10 clients)...
0
500
1,000
1,500
2,000
85th 95th 99th
Late
cy M
icros
econ
ds
different rows single row
Secondary Indexes...
synchronized access to indexed rows.
(Keyspace wide)
Index concurrency tests...
CF with 2 Indexes10,000 Inserts
6 Columns per Row35 bytes per Column
Alternating column values
Request latency and index concurrency (10 clients)...
0
1,000
2,000
3,000
4,000
85th 95th 99th
Late
cy M
icros
econ
ds
different rows single row
Index tests...
10,000 Inserts50 Columns per Row50 bytes per Column
Request latency and secondary indexes...
0
750
1,500
2,250
3,000
85th 95th 99th
Late
cy M
icros
econ
ds
no indexes six indexes
Today
Write PathRead Path
bloom_filter_fp_chance tests...1,000,000 Rows
50 Columns per Row50 bytes per Column
commitlog_total_space_in_mb: 1
Read random 10% of rows.
CF read latency and bloom_filter_fp_chance...
0
1,750
3,500
5,250
7,000
85th 95th 99th
Late
cy M
icros
econ
ds
default 0.000744. 0.1
key_cache_size_in_mb tests...
10,000 Rows50 Columns per Row50 bytes per Column
Read all Rows
CF read latency and key_cache_size_in_mb...
0
75
150
225
300
85th 95th 99th
Late
cy M
icros
econ
ds
default (100MB) 100% Hit Rate disabled
index_interval tests...100,000 Rows
50 Columns per Row50 bytes per Column
key_cache_size_in_mb: 0
Read 1 Column from random 10% of Rows
CF read latency and index_interval...
0
5,000
10,000
15,000
20,000
85th 95th 99th
Late
cy M
icros
econ
ds
index_interval=128 (default) index_interval=512
row_cache_size_in_mb tests...
100,000 Rows50 Columns per Row50 bytes per Column
Read all Rows
CF read latency and row_cache_size_in_mb...
0
65
130
195
260
85th 95th 99th
Late
cy M
icros
econ
ds
row_cache_size_in_mb=0 and key_cache_size_in_mb=100mbrow_cache_size_in_mb=100mb and key_cache_size_in_mb=0
Column Index tests...
Read first Column by name from 1,200 Columns.
Read first Column by name from 1,000,000
Columns.
CF read latency and Column Index...
0
1,500
3,000
4,500
6,000
85th 95th 99th
Late
cy M
icros
econ
ds
First Column from 1,200 First Column from 1,000,000
Name Locality tests...1,000,000 Columns
50 bytes per Column
Read 100 Columns from middle of row.Read 100 Columns from spread across row.
CF read latency and name locality...
0
50,000
100,000
150,000
200,000
85th 95th 99th
Late
cy M
icros
econ
ds
Adjacent Columns Spread Columns
Start position tests...1,000,000 Columns
50 bytes per Column
Read first 100 Columns without start.Read first 100 Columns with start.
CF read latency and start position...
0
10,000
20,000
30,000
40,000
85th 95th 99th
Late
cy M
icros
econ
ds
Without start position With start position
Start offset tests...1,000,000 Columns
50 bytes per Column
Read first 100 Columns with start.Read middle 100 Columns with start.
CF read latency and start offset...
0
10,000
20,000
30,000
40,000
85th 95th 99th
Late
cy M
icros
econ
ds
First MIddle
Start offset tests...1,000,000 Columns
50 bytes per Column
Read first 100 Columns without start.Read last 100 Columns with reversed.
CF read latency and reversed...
0
10,000
20,000
30,000
40,000
85th 95th 99th
Late
cy M
icros
econ
ds
Forward Reversed
Thanks.
Aaron Morton@aaronmorton
www.thelastpickle.com
Licensed under a Creative Commons Attribution-NonCommercial 3.0 New Zealand License