Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH

Real-Time Analytics with Apache CassandraCassandra Day Berlin, 11.2.2016

Guido Schmutz

Working for Trivadis for more than 19 yearsOracle ACE Director for Fusion Middleware and SOACo-Author of different booksConsultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast DataMember of Trivadis Architecture BoardTechnology Manager @ Trivadis

More than 25 years of software development experience

Contact: guido.schmutz@trivadis.comBlog: http://guidoschmutz.wordpress.comSlideshare: http://de.slideshare.net/gschmutzTwitter: gschmutz

Our company.

Trivadis is a market leader in IT consulting, system integration, solution engineeringand the provision of IT services focusing on and and Open Source technologiesin Switzerland, Germany, Austria and Denmark. We offer our services in the followingstrategic business fields:

Trivadis Services takes over the interacting operation of your IT systems.

O P E R A T I O N

COPENHAGEN

MUNICH

LAUSANNEBERN

ZURICHBRUGG

GENEVA

HAMBURG

DÜSSELDORF

FRANKFURT

STUTTGART

FREIBURG

VIENNA

With over 600 specialists and IT experts in your region.

14 Trivadis branches and more than600 employees

200 Service Level Agreements

Over 4,000 training participants

Research and development budget:CHF 5.0 million

Financially self-supporting andsustainably profitable

Experience from more than 1,900 projects per year at over 800customers

Agenda

1. Customer Use Case and Architecture2. Cassandra Data Modeling3. Cassandra for Timeseries Data4. Titan:db for Graph Data

Customer Use Case andArchitecture

Data Science Lab @ Armasuisse W&T

W+T flagship project, standing for innovation & tech transfer

Building capabilities in the areas of:• Social Media Intelligence

(SOCMINT)

• Big Data Technologies & Architectures

Invest into new, innovative and not widely-proven technology• Batch / Real-time analysis

• NoSQL databases

• Text analysis (NLP)• Graph Data

• …

3 Phases: June 2013 – June 2015

SOCMINT System – Time Dimension

Major data model: Time series (TS)

TS reflect user behaviors over time

Activities correlate with events

Anomaly detectionEvent detection & prediction

SOCMINT System – Social Dimension

User-user networks (social graphs);

Twitter: follower, retweet and mention graphs

Who is central in a social network?

Who has retweeted a given tweet to whom?

SOCMINT System - “Lambda Architecture” for Big Data

DataCollection

(Analytical)BatchDataProcessing

Batchcompute

BatchResultStoreDataSources

Channel

DataAccess

Reports

Service

AnalyticTools

AlertingTools

Social

Sensor

Logfiles

Mobile

Machine

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

Batchcompute

Real-TimeResultStore

Messaging

ResultStore

QueryEngine

ResultStore

ComputedInformation

RawData(Reservoir)

=DatainMotion =DataatRest10

SOCMINT System – Frameworks & Components in Use

DataCollection

(Analytical)BatchDataProcessing

Batchcompute

BatchResultStoreDataSources

Channel

DataAccess

Reports

Service

AnalyticTools

AlertingTools

Social

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

Batchcompute

Real-TimeResultStore

Messaging

ResultStore

QueryEngine

ResultStore

ComputedInformation

RawData(Reservoir)

=DatainMotion =DataatRest11

Streaming Analytics Processing Pipeline

Kafka provides reliable and efficient queuing

Storm processes (rollups, counts)

Cassandra stores results at same speed

StoringProcessingQueuing

TwitterSensor 1

TwitterSensor 2

TwitterSensor 3

VisualizationApplication

Cassandra Data Modeling

Cassandra Data Modelling

• Don’t think relational !

• Denormalize, Denormalize, Denormalize ….

• Rows are gigantic and sorted = one row is stored on one node• Know your application/use cases => from query to model

• Index is not an afterthought, anymore => “index” upfront• Control physical storage structure

“Static” Tables – “Skinny Row”

rowkey

CREATE TABLE skinny (rowkey text, c1 text PRIMARY KEY,c2 text,c3 text,

PRIMARY KEY (rowkey));

Growsup

toBillionofRow

rowkey-1 c1 c2 c3value-c1 value-c2 value-c3

rowkey-2 c1 c3value-c1 value-c3

rowkey-3 c1 c2 c3value-c1 value-c2 value-c3

c1 c2 c3

PartitionKey

“Dynamic” Tables – “Wide Row”

rowkey

Billion

ows rowkey-1 ckey-1:c1 ckey-1:c2

value-c1 value-c2

rowkey-2

rowkey-3

CREATE TABLE wide (rowkey text, ckey text,c1 text,c2 text,

PRIMARY KEY (rowkey, ckey) WITH CLUSTERING ORDER BY (ckey ASC);

ckey-2:c1 ckey-2:c2value-c1 value-c2

1 2Billion

PartitionKey Clustering Key

Cassandra for Timeseries Data

Show Timeseries: Provide list of metrics

CREATE TABLE tweet_count (sensor_id text,bucket_id text,key text,time_id timestamp,count counter,

PRIMARY KEY((sensor_id, bucket_id), key, time_id))WITH CLUSTERING ORDER BY (key ASC, time_id DESC);

Use of “Static” Table

bucket-id defines buckets of values • HOUR-2015-10 = values

collected hourly in one partition for one month

ABC-001:HOUR-2015-10 dse:10:00:count1’550

ABC-001:DAY-2015-10 dse:14-OCT:count105’999

dse:13-OCT:count120’344

nosql:14-OCT:count2’532

dse:09:00:count2’299

nosql:10:00:count25

30d*24h*nkeys=n*720cols

OpenSourceTimeSeriesDBsoverCassandra:KairosDB: https://kairosdb.github.io/Heroic: http://spotify.github.io/heroicPartitionKey Clustering Key

Show Timeseries: Provide list of metrics

UPDATE tweet_count SET count = count + 1WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10'AND key = 'ALL’ AND time_id = '2015-10-14 10:00:00';

SELECT * from tweet_countWHERE sensor_id = 'ABC-001' AND bucket_id = 'HOUR-2015-10'AND key = 'ALL' AND time_id >= '2015-10-14 08:00:00’;

sensor_id | bucket_id | key | time_id | count----------+--------------+-----+--------------------------+-------ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 12:00:00+0000 | 100230 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 11:00:00+0000 | 102230 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 10:00:00+0000 | 105430 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 09:00:00+0000 | 203240 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 08:00:00+0000 | 132230

PartitionKey Clustering Key

Titan:db & Cassandra for Graph Data

Supporting Graph Data with Titan:db and Cassandra

http://thinkaurelius.github.io/titan/

Gremlin in Action – Creating the Graph

Gremlin in Action – Graph Traversal

Gremlin in Action – Graph Traversal (II)

Summary - Know your domain

Connectedness ofDatalow high

DocumentDataStore

Key-ValueStores

Wide-ColumnStore

GraphDatabases

RelationalDatabases

Guido SchmutzEmail: guido.schmutz@trivadis.com+41 79 412 05 39

Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

Technology

Transcript of Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

Amazon Managed Apache Cassandra Service - Developer GuideCassandra Query Language (CQL) is the primary language for communicating with Apache Cassandra. Amazon Managed Apache Cassandra

Cassandra Day Denver 2014: Introduction to Apache Cassandra

Apache Cassandra Ignite Presentation

Apache cassandra & apache spark for time series data

DevCenter Apache cassandra

Talk About Apache Cassandra

Apache Cassandra, part 3 – machinery, work with Cassandra

Apache Cassandra at Target - Cassandra Summit 2014

About "Apache Cassandra"

Apache Cassandra and Drivers

Apache cassandra architecture internals

Apache Cassandra in Bangalore - Cassandra Internals and Performance

Introduction to Apache Cassandra - DataStax - · PDF fileIntroduction to Apache Cassandra . 2" ... Apache Cassandra™ is a massively scalable NoSQL database. Cassandra’s technical

Apache Cassandra at the Geek2Geek Berlin

Introduction to Apache Cassandra

with Kaa, Apache Cassandra, and Apache Zeppelin … · Real-time IoT data analytics and visualization with Kaa, Apache Cassandra, and Apache Zeppelin. Agenda Why Kaa? Why Cassandra?

Apache Cassandra at Wayin

Apache Cassandra in Action - O'Reilly Mediaassets.en.oreilly.com/1/event/55/Apache Cassandra in Action... · Apache Cassandra in Action. Why Cassandra? ... Cassandra in production.

Support Apache Cassandra in Production · Anuj Wadehra . Architect & Cassandra SME . Ericsson R & D . Support APACHE Cassandra in Production

NOSQL Database: Apache Cassandra