Bulk Loading Data into Cassandra

Post on 26-Jan-2015

130 views 12 download

Tags:

description

Whether running load tests or migrating historic data, loading data directly into Cassandra can be very useful to bypass the system’s write path. In this webinar, we will look at how data is stored on disk in sstables, how to generate these structures directly, and how to load this data rapidly into your cluster using sstableloader. We'll also review different use cases for when you should and shouldn't use this method.

Transcript of Bulk Loading Data into Cassandra

Bulk-Loading Data into Cassandra

Patricia Gorla@patriciagorlaCassandra Consultantwww.thelastpickle.com

Planet Cassandra 2014

About Us

• Work with clients to deliver and improve Apache Cassandra services

• Apache Cassandra committer, Datastax MVP, Hector maintainer, Apache Usergrid committer

• Based in New Zealand & USA

Why is bulk loading useful?

• Performance tests

Why is bulk loading useful?

• Performance tests

• Migrating historical data

Why is bulk loading useful?

• Performance tests

• Migrating historical data

• Changing topologies

!

• How Data is Stored

• Case Studies

- Generating Dummy Data

- Backfilling Historical Data

- Changing Topologies

• Conclusion

Cassandra Write Path write[0]

Cassandra Write Path• Writes written to both the commit log and

memtable.

write[0]

memtablecommitlog

Cassandra Write Path• Writes written to both the commit log and

memtable.

• Memtable is sorted.

write[0]

memtablecommitlog

Cassandra Write Path• Memtable flushed out to sstables.

sstable[0]sstable[1]

sstable[2]

write[0]

memtablecommitlog

Cassandra Write Path• Compaction helps keep the read latency

low.

sstable[0]sstable[1]

sstable[2]

sstable[n]

write[0]

memtablecommitlog

Sorted String Tables

mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt

Sorted String Tables

mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt

Contains all data needed to regenerate components

Sorted String Tables

mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt

Index of row keys

Sorted String Tables

mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt

Index summary from Index.db file

Sorted String Tables

mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt

Bloom filter over sstable

Sorted String Tables

mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt

Table of contents of all components

!

• How Data is Stored

• Case Studies

- Generating Dummy Data

- Backfilling Historical Data

- Changing Topologies

• Conclusion

Set up keyspace and column family

create keyspace test with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1}; !

create column family test with comparator = 'AsciiType' and default_validation_class = 'AsciiType' and key_validation_class = 'AsciiType';

SStableGen.java

AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, size_per_sstable_mb );

// subcomparator for super columns

SStableGen.java

AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, size_per_sstable_mb );

// subcomparator for super columns

SStableGen.java

AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, size_per_sstable_mb );

// subcomparator for super columns

ByteBuffer randomBytes = ByteBufferUtil.bytes(randomAscii(1024)); KeyGenerator keyGen = new KeyGenerator(); long dataSize = 0; writer = new SSTableSimpleUnsortedWriter(…); while (dataSize < max_data_bytes) { writer.newRow(key); for (int j=0; j<num_cols; j++) { ByteBuffer colName = ByteBufferUtil.bytes("col_" + j); ByteBuffer colValue = ByteBuffer.wrap(new byte[20]); randomBytes.get(colValue.array()); colValue.position(0); writer.addColumn(colName, colValue, timestamp); if (randomBytes.remaining() < colValue.limit()) { randomBytes.position(0); } else { randomBytes.position(randomBytes.position() + colValue.limit()); } } } }

Examining sstable output

patricia@dev:~/../data$ ls -lh mykeyspace/mycf total 64 -rw-r--r-- 1 patricia staff 43B Feb 2 15:31 mykeyspace-mycf-jb-1-CompressionInfo.db -rw-r--r-- 1 patricia staff 79K Feb 2 15:31 mykeyspace-mycf-jb-1-Data.db -rw-r--r-- 1 patricia staff 16B Feb 2 15:31 mykeyspace-mycf-jb-1-Filter.db -rw-r--r-- 1 patricia staff 36B Feb 2 15:31 mykeyspace-mycf-jb-1-Index.db -rw-r--r-- 1 patricia staff 4.3K Feb 2 15:31 mykeyspace-mycf-jb-1-Statistics.db -rw-r--r-- 1 patricia staff 80B Feb 2 15:31 mykeyspace-mycf-jb-1-Summary.db -rw-r--r-- 1 patricia staff 79B Feb 2 15:31 mykeyspace-mycf-jb-1-TOC.txt

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]

$ bin/sstableloader Keyspace1/ColFam1

• Run command on separate server

$ bin/sstableloader Keyspace1/ColFam1

• Run command on separate server

• Throttle command

$ bin/sstableloader Keyspace1/ColFam1

• Run command on separate server

• Throttle command

• Parallelise processes

!

• How Data is Stored

• Case Studies

- Generating Dummy Data

- Backfilling Historical Data

- Changing Topologies

• Conclusion

// list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); !// assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } !customerOrders.close() orders.close()

// list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); !// assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } !customerOrders.close() orders.close()

// list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); !// assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } !customerOrders.close() orders.close()

!

• How Data is Stored

• Case Studies

- Generating Dummy Data

- Backfilling Historical Data

- Changing Topologies

• Conclusion

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d \ cass1,cass2,cass3 !Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] !progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d \ cass1,cass2,cass3 !Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] !progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d \ cass1,cass2,cass3 !Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] !progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]

$ bin/sstableloader Keyspace1/ColFam1

patricia@dev:~/.../cassandra-2.0.4$ bin/nodetool compactionstats pending tasks: 30 Active compaction remaining time : n/a

!

• How Data is Stored

• Case Studies

- Generating Dummy Data

- Backfilling Historical Data

- Changing Topologies

• Conclusion

CQL: Keep schema consistent

cqlsh> CREATE KEYSPACE "test" WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; !

cqlsh> CREATE COLUMNFAMILY "test" (id text PRIMARY KEY ) ;

CQL3 Considerations

• Uses CompositeType comparator

Q&A

Patricia Gorla@patriciagorlaCassandra Consultantwww.thelastpickle.com

Planet Cassandra 2014