Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

Post on 21-Jan-2017

1.065 views 0 download

Transcript of Real Time Analytics with Apache Cassandra - Cassandra Day Berlin

BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH

Real-Time Analytics with Apache CassandraCassandra Day Berlin, 11.2.2016

Guido Schmutz

Guido Schmutz

Working for Trivadis for more than 19 yearsOracle ACE Director for Fusion Middleware and SOACo-Author of different booksConsultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast DataMember of Trivadis Architecture BoardTechnology Manager @ Trivadis

More than 25 years of software development experience

Contact: guido.schmutz@trivadis.comBlog: http://guidoschmutz.wordpress.comSlideshare: http://de.slideshare.net/gschmutzTwitter: gschmutz

2

Our company.

© Trivadis – The Company3 2/11/16

Trivadis is a market leader in IT consulting, system integration, solution engineeringand the provision of IT services focusing on and and Open Source technologiesin Switzerland, Germany, Austria and Denmark. We offer our services in the followingstrategic business fields:

Trivadis Services takes over the interacting operation of your IT systems.

O P E R A T I O N

COPENHAGEN

MUNICH

LAUSANNEBERN

ZURICHBRUGG

GENEVA

HAMBURG

DÜSSELDORF

FRANKFURT

STUTTGART

FREIBURG

BASEL

VIENNA

With over 600 specialists and IT experts in your region.

© Trivadis – The Company4 2/11/16

14 Trivadis branches and more than600 employees

200 Service Level Agreements

Over 4,000 training participants

Research and development budget:CHF 5.0 million

Financially self-supporting andsustainably profitable

Experience from more than 1,900 projects per year at over 800customers

Agenda

1. Customer Use Case and Architecture2. Cassandra Data Modeling3. Cassandra for Timeseries Data4. Titan:db for Graph Data

5

Customer Use Case andArchitecture

6

Data Science Lab @ Armasuisse W&T

W+T flagship project, standing for innovation & tech transfer

Building capabilities in the areas of:• Social Media Intelligence

(SOCMINT)

• Big Data Technologies & Architectures

Invest into new, innovative and not widely-proven technology• Batch / Real-time analysis

• NoSQL databases

• Text analysis (NLP)• Graph Data

• …

3 Phases: June 2013 – June 2015

7

SOCMINT System – Time Dimension

Major data model: Time series (TS)

TS reflect user behaviors over time

Activities correlate with events

Anomaly detectionEvent detection & prediction

8

SOCMINT System – Social Dimension

User-user networks (social graphs);

Twitter: follower, retweet and mention graphs

Who is central in a social network?

Who has retweeted a given tweet to whom?

9

SOCMINT System - “Lambda Architecture” for Big Data

DataCollection

(Analytical)BatchDataProcessing

Batchcompute

BatchResultStoreDataSources

Channel

DataAccess

Reports

Service

AnalyticTools

AlertingTools

Social

RDBMS

Sensor

ERP

Logfiles

Mobile

Machine

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

Batchcompute

Real-TimeResultStore

Messaging

ResultStore

QueryEngine

ResultStore

ComputedInformation

RawData(Reservoir)

=DatainMotion =DataatRest10

SOCMINT System – Frameworks & Components in Use

DataCollection

(Analytical)BatchDataProcessing

Batchcompute

BatchResultStoreDataSources

Channel

DataAccess

Reports

Service

AnalyticTools

AlertingTools

Social

(Analytical)Real-TimeDataProcessing

Stream/EventProcessing

Batchcompute

Real-TimeResultStore

Messaging

ResultStore

QueryEngine

ResultStore

ComputedInformation

RawData(Reservoir)

=DatainMotion =DataatRest11

Streaming Analytics Processing Pipeline

Kafka provides reliable and efficient queuing

Storm processes (rollups, counts)

Cassandra stores results at same speed

StoringProcessingQueuing

12

TwitterSensor 1

TwitterSensor 2

TwitterSensor 3

VisualizationApplication

VisualizationApplication

Cassandra Data Modeling

13

Cassandra Data Modelling

14

• Don’t think relational !

• Denormalize, Denormalize, Denormalize ….

• Rows are gigantic and sorted = one row is stored on one node• Know your application/use cases => from query to model

• Index is not an afterthought, anymore => “index” upfront• Control physical storage structure

“Static” Tables – “Skinny Row”

15

rowkey

CREATE TABLE skinny (rowkey text, c1 text PRIMARY KEY,c2 text,c3 text,

PRIMARY KEY (rowkey));

Growsup

toBillionofRow

s

rowkey-1 c1 c2 c3value-c1 value-c2 value-c3

rowkey-2 c1 c3value-c1 value-c3

rowkey-3 c1 c2 c3value-c1 value-c2 value-c3

c1 c2 c3

PartitionKey

“Dynamic” Tables – “Wide Row”

16

rowkey

Billion

ofR

ows rowkey-1 ckey-1:c1 ckey-1:c2

value-c1 value-c2

rowkey-2

rowkey-3

CREATE TABLE wide (rowkey text, ckey text,c1 text,c2 text,

PRIMARY KEY (rowkey, ckey) WITH CLUSTERING ORDER BY (ckey ASC);

ckey-2:c1 ckey-2:c2value-c1 value-c2

ckey-3:c1 ckey-3:c2value-c1 value-c2

ckey-1:c1 ckey-1:c2value-c1 value-c2

ckey-2:c1 ckey-2:c2value-c1 value-c2

ckey-1:c1 ckey-1:c2value-c1 value-c2

ckey-2:c1 ckey-2:c2value-c1 value-c2

ckey-3:c1 ckey-3:c2value-c1 value-c2

1 2Billion

PartitionKey Clustering Key

Cassandra for Timeseries Data

17

Show Timeseries: Provide list of metrics

18

CREATE TABLE tweet_count (sensor_id text,bucket_id text,key text,time_id timestamp,count counter,

PRIMARY KEY((sensor_id, bucket_id), key, time_id))WITH CLUSTERING ORDER BY (key ASC, time_id DESC);

Use of “Static” Table

bucket-id defines buckets of values • HOUR-2015-10 = values

collected hourly in one partition for one month

ABC-001:HOUR-2015-10 dse:10:00:count1’550

ABC-001:DAY-2015-10 dse:14-OCT:count105’999

dse:13-OCT:count120’344

nosql:14-OCT:count2’532

dse:09:00:count2’299

nosql:10:00:count25

30d*24h*nkeys=n*720cols

OpenSourceTimeSeriesDBsoverCassandra:KairosDB: https://kairosdb.github.io/Heroic: http://spotify.github.io/heroicPartitionKey Clustering Key

Show Timeseries: Provide list of metrics

19

UPDATE tweet_count SET count = count + 1WHERE sensor_id = 'ABC-001’ AND bucket_id = 'HOUR-2015-10'AND key = 'ALL’ AND time_id = '2015-10-14 10:00:00';

SELECT * from tweet_countWHERE sensor_id = 'ABC-001' AND bucket_id = 'HOUR-2015-10'AND key = 'ALL' AND time_id >= '2015-10-14 08:00:00’;

sensor_id | bucket_id | key | time_id | count----------+--------------+-----+--------------------------+-------ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 12:00:00+0000 | 100230 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 11:00:00+0000 | 102230 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 10:00:00+0000 | 105430 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 09:00:00+0000 | 203240 ABC-001 | HOUR-2015-10 | ALL | 2015-10-14 08:00:00+0000 | 132230

PartitionKey Clustering Key

Titan:db & Cassandra for Graph Data

20

Supporting Graph Data with Titan:db and Cassandra

21

http://thinkaurelius.github.io/titan/

Gremlin in Action – Creating the Graph

22

Gremlin in Action – Graph Traversal

23

Gremlin in Action – Graph Traversal (II)

24

Summary - Know your domain

Connectedness ofDatalow high

DocumentDataStore

Key-ValueStores

Wide-ColumnStore

GraphDatabases

RelationalDatabases

Guido SchmutzEmail: guido.schmutz@trivadis.com+41 79 412 05 39

26