Download - Bulk Loading Data into Cassandra

Transcript
Page 1: Bulk Loading Data into Cassandra

Bulk-Loading Data into Cassandra

Patricia Gorla@patriciagorlaCassandra Consultantwww.thelastpickle.com

Planet Cassandra 2014

Page 2: Bulk Loading Data into Cassandra

About Us

• Work with clients to deliver and improve Apache Cassandra services

• Apache Cassandra committer, Datastax MVP, Hector maintainer, Apache Usergrid committer

• Based in New Zealand & USA

Page 3: Bulk Loading Data into Cassandra

Why is bulk loading useful?

• Performance tests

Page 4: Bulk Loading Data into Cassandra

Why is bulk loading useful?

• Performance tests

• Migrating historical data

Page 5: Bulk Loading Data into Cassandra

Why is bulk loading useful?

• Performance tests

• Migrating historical data

• Changing topologies

Page 6: Bulk Loading Data into Cassandra

!

• How Data is Stored

• Case Studies

- Generating Dummy Data

- Backfilling Historical Data

- Changing Topologies

• Conclusion

Page 7: Bulk Loading Data into Cassandra

Cassandra Write Path write[0]

Page 8: Bulk Loading Data into Cassandra

Cassandra Write Path• Writes written to both the commit log and

memtable.

write[0]

memtablecommitlog

Page 9: Bulk Loading Data into Cassandra

Cassandra Write Path• Writes written to both the commit log and

memtable.

• Memtable is sorted.

write[0]

memtablecommitlog

Page 10: Bulk Loading Data into Cassandra

Cassandra Write Path• Memtable flushed out to sstables.

sstable[0]sstable[1]

sstable[2]

write[0]

memtablecommitlog

Page 11: Bulk Loading Data into Cassandra

Cassandra Write Path• Compaction helps keep the read latency

low.

sstable[0]sstable[1]

sstable[2]

sstable[n]

write[0]

memtablecommitlog

Page 12: Bulk Loading Data into Cassandra

Sorted String Tables

mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt

Page 13: Bulk Loading Data into Cassandra

Sorted String Tables

mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt

Contains all data needed to regenerate components

Page 14: Bulk Loading Data into Cassandra

Sorted String Tables

mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt

Index of row keys

Page 15: Bulk Loading Data into Cassandra

Sorted String Tables

mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt

Index summary from Index.db file

Page 16: Bulk Loading Data into Cassandra

Sorted String Tables

mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt

Bloom filter over sstable

Page 17: Bulk Loading Data into Cassandra

Sorted String Tables

mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt

Table of contents of all components

Page 18: Bulk Loading Data into Cassandra

!

• How Data is Stored

• Case Studies

- Generating Dummy Data

- Backfilling Historical Data

- Changing Topologies

• Conclusion

Page 19: Bulk Loading Data into Cassandra

Set up keyspace and column family

create keyspace test with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1}; !

create column family test with comparator = 'AsciiType' and default_validation_class = 'AsciiType' and key_validation_class = 'AsciiType';

Page 20: Bulk Loading Data into Cassandra

SStableGen.java

AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, size_per_sstable_mb );

// subcomparator for super columns

Page 21: Bulk Loading Data into Cassandra

SStableGen.java

AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, size_per_sstable_mb );

// subcomparator for super columns

Page 22: Bulk Loading Data into Cassandra

SStableGen.java

AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, size_per_sstable_mb );

// subcomparator for super columns

Page 23: Bulk Loading Data into Cassandra

ByteBuffer randomBytes = ByteBufferUtil.bytes(randomAscii(1024)); KeyGenerator keyGen = new KeyGenerator(); long dataSize = 0; writer = new SSTableSimpleUnsortedWriter(…); while (dataSize < max_data_bytes) { writer.newRow(key); for (int j=0; j<num_cols; j++) { ByteBuffer colName = ByteBufferUtil.bytes("col_" + j); ByteBuffer colValue = ByteBuffer.wrap(new byte[20]); randomBytes.get(colValue.array()); colValue.position(0); writer.addColumn(colName, colValue, timestamp); if (randomBytes.remaining() < colValue.limit()) { randomBytes.position(0); } else { randomBytes.position(randomBytes.position() + colValue.limit()); } } } }

Page 24: Bulk Loading Data into Cassandra

Examining sstable output

patricia@dev:~/../data$ ls -lh mykeyspace/mycf total 64 -rw-r--r-- 1 patricia staff 43B Feb 2 15:31 mykeyspace-mycf-jb-1-CompressionInfo.db -rw-r--r-- 1 patricia staff 79K Feb 2 15:31 mykeyspace-mycf-jb-1-Data.db -rw-r--r-- 1 patricia staff 16B Feb 2 15:31 mykeyspace-mycf-jb-1-Filter.db -rw-r--r-- 1 patricia staff 36B Feb 2 15:31 mykeyspace-mycf-jb-1-Index.db -rw-r--r-- 1 patricia staff 4.3K Feb 2 15:31 mykeyspace-mycf-jb-1-Statistics.db -rw-r--r-- 1 patricia staff 80B Feb 2 15:31 mykeyspace-mycf-jb-1-Summary.db -rw-r--r-- 1 patricia staff 79B Feb 2 15:31 mykeyspace-mycf-jb-1-TOC.txt

Page 25: Bulk Loading Data into Cassandra

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]

Page 26: Bulk Loading Data into Cassandra

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]

Page 27: Bulk Loading Data into Cassandra

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]

Page 28: Bulk Loading Data into Cassandra

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]

Page 29: Bulk Loading Data into Cassandra

$ bin/sstableloader Keyspace1/ColFam1

• Run command on separate server

Page 30: Bulk Loading Data into Cassandra

$ bin/sstableloader Keyspace1/ColFam1

• Run command on separate server

• Throttle command

Page 31: Bulk Loading Data into Cassandra

$ bin/sstableloader Keyspace1/ColFam1

• Run command on separate server

• Throttle command

• Parallelise processes

Page 32: Bulk Loading Data into Cassandra

!

• How Data is Stored

• Case Studies

- Generating Dummy Data

- Backfilling Historical Data

- Changing Topologies

• Conclusion

Page 33: Bulk Loading Data into Cassandra

// list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); !// assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } !customerOrders.close() orders.close()

Page 34: Bulk Loading Data into Cassandra

// list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); !// assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } !customerOrders.close() orders.close()

Page 35: Bulk Loading Data into Cassandra

// list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); !// assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } !customerOrders.close() orders.close()

Page 36: Bulk Loading Data into Cassandra

!

• How Data is Stored

• Case Studies

- Generating Dummy Data

- Backfilling Historical Data

- Changing Topologies

• Conclusion

Page 37: Bulk Loading Data into Cassandra

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d \ cass1,cass2,cass3 !Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] !progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]

Page 38: Bulk Loading Data into Cassandra

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d \ cass1,cass2,cass3 !Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] !progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]

Page 39: Bulk Loading Data into Cassandra

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d \ cass1,cass2,cass3 !Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] !progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]

Page 40: Bulk Loading Data into Cassandra

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/.../cassandra-2.0.4$ bin/nodetool compactionstats pending tasks: 30 Active compaction remaining time : n/a

Page 41: Bulk Loading Data into Cassandra

!

• How Data is Stored

• Case Studies

- Generating Dummy Data

- Backfilling Historical Data

- Changing Topologies

• Conclusion

Page 42: Bulk Loading Data into Cassandra

CQL: Keep schema consistent

cqlsh> CREATE KEYSPACE "test" WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; !

cqlsh> CREATE COLUMNFAMILY "test" (id text PRIMARY KEY ) ;

Page 43: Bulk Loading Data into Cassandra

CQL3 Considerations

• Uses CompositeType comparator

Page 44: Bulk Loading Data into Cassandra

Q&A

Patricia Gorla@patriciagorlaCassandra Consultantwww.thelastpickle.com

Planet Cassandra 2014