C* Summit EU 2013: Effective Cassandra Development with Achilles
Cassandra summit-2013
-
Upload
dfilppi -
Category
Technology
-
view
1.175 -
download
1
Transcript of Cassandra summit-2013
Real Time Big Data With Storm, Cassandra, and In-Memory Computing
DeWayne Filppi@dfilppi
Big Data Predictions
“Over the next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.”
Edd Dumbill, O’REILLY
2® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved3
The Two Vs of Big Data
Velocity Volume
We’re Living in a Real Time World…Homeland Security
Real Time Search
Social
eCommerce
User Tracking & Engagement
Financial Services
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved4
The Flavors of Big Data Analytics
Counting Correlating Research
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved5
Analytics @ Twitter – Counting
How many signups, tweets, retweets for a topic?
What’s the average latency?
Demographics Countries and cities Gender Age groups Device types …
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved6
Analytics @ Twitter – Correlating
What devices fail at the same time?
What features get user hooked?
What places on the globe are “happening”?
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved7
Analytics @ Twitter – Research
Sentiment analysis “Obama is popular”
Trends “People like to tweet
after watching American Idol”
Spam patterns How can you tell when
a user spams?
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved8
It’s All about Timing
“Real time” (< few Seconds)
Reasonably Quick (seconds - minutes)
Batch (hours/days)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved9
It’s All about Timing
• Event driven / stream processing • High resolution – every tweet gets counted
• Ad-hoc querying • Medium resolution (aggregations)
• Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved10
This is what we’re here to discuss
VELOCITY + VAST VOLUME = IN MEMORY + BIG DATA
11
RAM is the new disk Data partitioned across a cluster
Large “virtual” memory space Transactional Highly available Code collocated with data.
In Memory Data Grid Review
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved12
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved13
Data Grid + Cassandra: A Complete Solution• Data flows through the in-memory cluster async to Cassandra• Side effects calculated• Filtering an option• Enrichment an option• Results instantly available• Internal and external event listeners notified
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved14
Simplified Event Flow
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved15
Grid – Cassandra Interface Hector and CQL based interface In memory data must be mapped to column families.
Configurable class to column family mapping Must serialize individual fields
Fixed fields can use defined types Variable fields ( for schemaless in-memory mode) need serializers
Object model flattening By default, nested fields are flattened. Can be overridden by custom serializer.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved16
Virtues and Limitations
Could be faster: high availability has a cost Complex flows not easy to assemble or understand with simple
event handlers
Complete stack, not just two tools of many Fast.
Microsecond latencies for in memory operations Fast enough for almost anybody
Highly available/self healing Elastic
BUT
Popular open source, real time, in-memory, streaming computation platform.
Includes distributed runtime and intuitive API for defining distributed processing flows.
Scalable and fault tolerant. Developed at BackType, and open sourced by Twitter
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved17
Storm Background
Streams Unbounded sequence of tuples
Spouts Source of streams (Queues)
Bolts Functions, Filters, Joins, Aggregations
Topologies
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved18
Storm AbstractionsSpout
Bolt
Topologies
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved19
Streaming word count with Storm
Storm has a simple builder interface to creating stream processing topologies
Storm delegates persistence to external providers Cassandra, because of its write performance, is commonly used
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved20
Storm : Optimistic Processing Storm (quite rationally) assumes success is normal Storm uses batching and pipelining for performance Therefore the spout must be able to replay tuples on demand
in case of error. Any kind of quasi-queue like data source can be fashioned
into a spout. No persistence is ever required, and speed attained by
minimizing network hops during topology processing.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved21
Fast. Want to go faster? Eliminate non-memory components Substitute disk based queue for reliable in-memory queue Substitute disk based state persistence to in-memory
persistence Asynchronously update disk based state (C*)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved22
Sample Architecture
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved23
References Try the Cloudify recipe
Download Cloudify : http://www.cloudifysource.org/ Download the Recipe (apps/xapstream, services/xapstream):
– https://github.com/CloudifySource/cloudify-recipes XAP – Cassandra Interface Details;
http://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency Check out the source for the XAP Spout and a sample state
implementation backed by XAP, and a Storm friendly streaming implemention on github: https://github.com/Gigaspaces/storm-integration
For more background on the effort, check out my recent blog posts at http://blog.gigaspaces.com/ http://blog.gigaspaces.com/gigaspaces-and-storm-part-1-storm-clouds/ http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/ Part 3 coming soon.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved24
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved25
Twitter Storm With Cassandra
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved26
Storm Overview
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved27
Streams Unbounded sequence of tuples
Spouts Source of streams (Queues)
Bolts Functions, Filters, Joins, Aggregations
Topologies
Storm ConceptsSpouts
Bolt
Topologies
Challenge – Word Count
Word:Count
Tweets
Count?® Copyright 2013 Gigaspaces Ltd. All Rights Reserved28
• Hottest topics• URL mentions• etc.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved29
Streaming word count with Storm
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved30
Supercharging Storm Storm doesn’t supply persistence, but provides for it Storm optimizes IO to slow persistence (e.g. databases) using
batching. Storm processes streams. The stream provider itself needs to
support persistency, batching, and reliability.
Tweets, events,whatever….
XAP Real Time Analytics
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved31
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Two Layer Approach Advantage: Minimal
“impedance mismatch” between layers.– Both NoSQL cluster
technologies, with similar advantages
Grid layer serves as an in memory cache for interactive requests.
Grid layer serves as a real time computation fabric for CEP, and limited ( to allocated memory) real time distributed query capability.
In Memory Compute Cluster
NoSQL Cluster
...
Raw
Eve
nt S
trea
m
Raw
Eve
nt S
trea
m
Raw
Eve
nt S
trea
m
Raw And Derived Events
Rep
orti
ng E
ngin
e
SCALE
SCALE
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved33
Simplified Architecture
Flowing event streams through memory for side effects Event driven architecture executing in-memory Raw events flushed, aggregations/derivations retained All layers horizontally scalable All layers highly available Real-time analytics & cached batch analytics on same scalable
layer Data grid provides a transactional/consistent façade on NoSQL
store (in this case eliminating SQL database entirely)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved34
Key Concepts
Keep Things In Memory
Facebook keeps 80% of its data in Memory (Stanford research)
RAM is 100-1000x faster than Disk (Random seek)• Disk: 5 -10ms • RAM: ~0.001msec
Take Aways A data grid can serve different needs for big data analytics:
Supercharge a dedicated stream processing cluster like Storm.– Provide fast, reliable, transactional tuple streams and state
Provide a general purpose analytics platform– Roll your own
Simplify overall architecture while enhancing scalability– Ultra high performance/low latency– Dynamically scalable processing and in-memory storage– Eliminate messaging tier– Eliminate or minimize need for RDBMS
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved37
Realtime Analytics with Storm and Hadoop http://www.slideshare.net/Hadoop_Summit/realtime-analy
tics-with-storm Learn and fork the code on github:
https://github.com/Gigaspaces/storm-integration
Twitter Storm: http://storm-project.net
XAP + Storm Detailed Blog Post http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/
References
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved38