Bulk Cargo - Safe Loading and Unloading of Bulk Carriers 2003 - Uk Govt Regulations
Bulk Loading Data into Cassandra
-
Upload
datastax -
Category
Technology
-
view
130 -
download
12
description
Transcript of Bulk Loading Data into Cassandra
Bulk-Loading Data into Cassandra
Patricia Gorla@patriciagorlaCassandra Consultantwww.thelastpickle.com
Planet Cassandra 2014
About Us
• Work with clients to deliver and improve Apache Cassandra services
• Apache Cassandra committer, Datastax MVP, Hector maintainer, Apache Usergrid committer
• Based in New Zealand & USA
Why is bulk loading useful?
• Performance tests
Why is bulk loading useful?
• Performance tests
• Migrating historical data
Why is bulk loading useful?
• Performance tests
• Migrating historical data
• Changing topologies
!
• How Data is Stored
• Case Studies
- Generating Dummy Data
- Backfilling Historical Data
- Changing Topologies
• Conclusion
Cassandra Write Path write[0]
Cassandra Write Path• Writes written to both the commit log and
memtable.
write[0]
memtablecommitlog
Cassandra Write Path• Writes written to both the commit log and
memtable.
• Memtable is sorted.
write[0]
memtablecommitlog
Cassandra Write Path• Memtable flushed out to sstables.
sstable[0]sstable[1]
sstable[2]
write[0]
memtablecommitlog
Cassandra Write Path• Compaction helps keep the read latency
low.
sstable[0]sstable[1]
sstable[2]
sstable[n]
write[0]
memtablecommitlog
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt
Contains all data needed to regenerate components
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt
Index of row keys
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt
Index summary from Index.db file
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt
Bloom filter over sstable
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt
Table of contents of all components
!
• How Data is Stored
• Case Studies
- Generating Dummy Data
- Backfilling Historical Data
- Changing Topologies
• Conclusion
Set up keyspace and column family
create keyspace test with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1}; !
create column family test with comparator = 'AsciiType' and default_validation_class = 'AsciiType' and key_validation_class = 'AsciiType';
SStableGen.java
AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, size_per_sstable_mb );
// subcomparator for super columns
SStableGen.java
AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, size_per_sstable_mb );
// subcomparator for super columns
SStableGen.java
AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, size_per_sstable_mb );
// subcomparator for super columns
ByteBuffer randomBytes = ByteBufferUtil.bytes(randomAscii(1024)); KeyGenerator keyGen = new KeyGenerator(); long dataSize = 0; writer = new SSTableSimpleUnsortedWriter(…); while (dataSize < max_data_bytes) { writer.newRow(key); for (int j=0; j<num_cols; j++) { ByteBuffer colName = ByteBufferUtil.bytes("col_" + j); ByteBuffer colValue = ByteBuffer.wrap(new byte[20]); randomBytes.get(colValue.array()); colValue.position(0); writer.addColumn(colName, colValue, timestamp); if (randomBytes.remaining() < colValue.limit()) { randomBytes.position(0); } else { randomBytes.position(randomBytes.position() + colValue.limit()); } } } }
Examining sstable output
patricia@dev:~/../data$ ls -lh mykeyspace/mycf total 64 -rw-r--r-- 1 patricia staff 43B Feb 2 15:31 mykeyspace-mycf-jb-1-CompressionInfo.db -rw-r--r-- 1 patricia staff 79K Feb 2 15:31 mykeyspace-mycf-jb-1-Data.db -rw-r--r-- 1 patricia staff 16B Feb 2 15:31 mykeyspace-mycf-jb-1-Filter.db -rw-r--r-- 1 patricia staff 36B Feb 2 15:31 mykeyspace-mycf-jb-1-Index.db -rw-r--r-- 1 patricia staff 4.3K Feb 2 15:31 mykeyspace-mycf-jb-1-Statistics.db -rw-r--r-- 1 patricia staff 80B Feb 2 15:31 mykeyspace-mycf-jb-1-Summary.db -rw-r--r-- 1 patricia staff 79B Feb 2 15:31 mykeyspace-mycf-jb-1-TOC.txt
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
$ bin/sstableloader Keyspace1/ColFam1
• Run command on separate server
$ bin/sstableloader Keyspace1/ColFam1
• Run command on separate server
• Throttle command
$ bin/sstableloader Keyspace1/ColFam1
• Run command on separate server
• Throttle command
• Parallelise processes
!
• How Data is Stored
• Case Studies
- Generating Dummy Data
- Backfilling Historical Data
- Changing Topologies
• Conclusion
// list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); !// assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } !customerOrders.close() orders.close()
// list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); !// assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } !customerOrders.close() orders.close()
// list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); !// assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } !customerOrders.close() orders.close()
!
• How Data is Stored
• Case Studies
- Generating Dummy Data
- Backfilling Historical Data
- Changing Topologies
• Conclusion
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d \ cass1,cass2,cass3 !Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] !progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d \ cass1,cass2,cass3 !Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] !progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d \ cass1,cass2,cass3 !Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] !progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/.../cassandra-2.0.4$ bin/nodetool compactionstats pending tasks: 30 Active compaction remaining time : n/a
!
• How Data is Stored
• Case Studies
- Generating Dummy Data
- Backfilling Historical Data
- Changing Topologies
• Conclusion
CQL: Keep schema consistent
cqlsh> CREATE KEYSPACE "test" WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; !
cqlsh> CREATE COLUMNFAMILY "test" (id text PRIMARY KEY ) ;
CQL3 Considerations
• Uses CompositeType comparator
Q&A
Patricia Gorla@patriciagorlaCassandra Consultantwww.thelastpickle.com
Planet Cassandra 2014