Cassandra & MR
and how we used it in
Agenda
- Why do we need Cassandra & MapReduce - 3 notes about Cassandra
- 3 notes Cassandra + MapReduce
Real Time Bidding (RTB)
RTB
- How often? 10 - 30M auctions per day for mobile devices in Russia e.g. 10-30Gb of data /day
- What do we need? Be effective in showing Ads
- They call it "big data & data mining" We decided to use Cassandra&Hadoop for that
1. Cassandra tokens∆ = [T0, T4] - token range
ex. ∆ = [-2^63, +2^63]
Every time one writes (K,V) into Cassandra:
- ex. token(K) in [T2, T3]
- (K,V) will be put into node 3 (if replica 1)
2. Cassandra Load BalancingPartitioner generates tokens for your keysE.g. it creates token(K)
Cassandra offers the following partitioners:
● Murmur3Partitioner (default): Uniformly distributes data
across the cluster based on MurmurHash hash values.
● RandomPartitioner: Uniformly distributes data across the
cluster based on MD5 hash values.
● ByteOrderedPartitioner: Keeps an ordered distribution of data
lexically by key bytes
The Murmur3Partitioner is the default partitioning strategy for new Cassandra clusters and the right choice for new clusters in almost all cases.
http://www.datastax.com/docs/1.2/cluster_architecture/partitioners
3. Cassandra indexes and knows- Cassandra support common data formats E.g. byte, string, long, double
- Cassandra support secondary indexes E.g. you can select your data not bulky
- Cassandra knows how much data (records) in everytoken range
Cassandra & Map-ReduceGoogle says:
1. Cassandra is integrated with Map-Reduce http://wiki.apache.org/cassandra/HadoopSupport
2. It is outdated
3. It is used for Hadoop 1.0.3 or whatever version This means: Please install hadoop+mr cluster yourself
Cassandra & Map-Reduce (we want) 1. Cloudera Hadoop Distribution (CDH4) Cloudera manager installs your cluster in couple of clicks
2. Up to date (Cassandra 1.1.x - 1.2.x)
Solution:
A) Take Cassandra sources from http://cassandra.apache.org/download/
B) Take package org.apache.cassandra.hadoop and recompile it, having dependencies from CDH4&Cassadnra[1.x] And Jar is ready to go for your map-reduce jobs
1. Allocate your clusterDataStax says:
To configure a Cassandra cluster for Hadoop integration, overlay a Hadoop cluster over your Cassandra nodes. This involves installing a TaskTracker on each Cassandra node, and setting up a JobTracker and HDFS data node.
Why?
Because this:
works 100 times faster than this:
2. Number of map tasks
Job control parameter: InputSplitSize (default 65536)
Estimates how much data one mapper will receive
Every map task has it's own token range to read data from: [-2^63, +2^63] / number of map tasks
3. How job reads the data JobControlParameter: RangeBatchSize (default: 4096)
Bulk volume to read including your filters (primary & secondary indexes)
Cassandra does filtering job on server side
( [-2^63, +2^63] / number of map tasks )
Pros:
1. Easy to manage (Cassandra cluster & cloudera manager is
2. Easy to index
3. Supports query language & data types support
Cons:
1. Scalable extremely expensive (every node should run cassandra + hadoop)
2. Reading is very slow
3. Reading big amount is impossible
Note: Netflix reading using cassandra to manage the data.But their map-reduce jobs are reading sstable-files directly, avoiding Cassandra!
http://www.datastax.com/dev/blog/2012-in-review-performance
Conclusion
Top Related