Download - Introduction to Real-Time Analytics with Cassandra and Hadoop

Transcript
Page 1: Introduction to Real-Time Analytics with Cassandra and Hadoop

#strataconf + #hw2013

Real-Time Analytics with Cassandra and Hadoop

Patricia Gorla

Download code: bit.ly/1aB8Jy8 (12KB)

Page 2: Introduction to Real-Time Analytics with Cassandra and Hadoop

About Me• Solr• Cassandra• Datastax MVP

Download code: bit.ly/1aB8Jy8 (12KB)

Page 3: Introduction to Real-Time Analytics with Cassandra and Hadoop

• Introduction to Cassandra + 2 labs 15m Break ~ 14:30

• Analytics + 1 labs 15m Break ~ 16:30

• Extra Credit

Outline

Download code: bit.ly/1aB8Jy8 (12KB)

Page 4: Introduction to Real-Time Analytics with Cassandra and Hadoop

Introduction

Download code: bit.ly/1aB8Jy8 (12KB)

Page 5: Introduction to Real-Time Analytics with Cassandra and Hadoop

Getting Started

ArchitectureData Modeling

Download code: bit.ly/1aB8Jy8 (12KB)

Page 6: Introduction to Real-Time Analytics with Cassandra and Hadoop

History• Powered inbox search at Facebook• Open-sourced in 2008

Page 7: Introduction to Real-Time Analytics with Cassandra and Hadoop

Why Cassandra?• Linear scalability• Availability• Set it and forget it

Page 8: Introduction to Real-Time Analytics with Cassandra and Hadoop

Many companies use Cassandra.

...

Page 9: Introduction to Real-Time Analytics with Cassandra and Hadoop

What is Cassandra?• Dynamo distributed cluster (no vector

clocks)• Bigtable data model• No SPOF• Tuneably consistent

Page 10: Introduction to Real-Time Analytics with Cassandra and Hadoop

Cluster

Keyspace

Architecture

Page 11: Introduction to Real-Time Analytics with Cassandra and Hadoop

Column Family 1

Keyspace

Column Family 2

Page 12: Introduction to Real-Time Analytics with Cassandra and Hadoop

Column Family 1

Keyspace

Column Family 2

row1: {col1:val1,time,TTL; … }

Page 13: Introduction to Real-Time Analytics with Cassandra and Hadoop

Labintroduction/1-getting-started.md

Download code: bit.ly/1aB8Jy8 (12KB)

Page 14: Introduction to Real-Time Analytics with Cassandra and Hadoop

Getting StartedArchitecture

Data Modeling

Page 15: Introduction to Real-Time Analytics with Cassandra and Hadoop

WritesCommit Log -> Memtable -> SSTables

Source: datastax.com

Page 16: Introduction to Real-Time Analytics with Cassandra and Hadoop

Incoming write to cluster.

Page 17: Introduction to Real-Time Analytics with Cassandra and Hadoop
Page 18: Introduction to Real-Time Analytics with Cassandra and Hadoop
Page 19: Introduction to Real-Time Analytics with Cassandra and Hadoop

Data replicated to replicants.

Page 20: Introduction to Real-Time Analytics with Cassandra and Hadoop

Data partitioning by token ranges.

Page 21: Introduction to Real-Time Analytics with Cassandra and Hadoop

Data partitioning by virtual nodes.

Page 22: Introduction to Real-Time Analytics with Cassandra and Hadoop

Reads

Page 23: Introduction to Real-Time Analytics with Cassandra and Hadoop

Source: fusionio.com

High-level overview of reads.

Page 24: Introduction to Real-Time Analytics with Cassandra and Hadoop

Source: datastax.com

Page 25: Introduction to Real-Time Analytics with Cassandra and Hadoop

?

Reading from cluster.

Page 26: Introduction to Real-Time Analytics with Cassandra and Hadoop

Reading from cluster.

?

?

?

Page 27: Introduction to Real-Time Analytics with Cassandra and Hadoop

Reading from cluster.

Page 28: Introduction to Real-Time Analytics with Cassandra and Hadoop

Reading from cluster.

Page 29: Introduction to Real-Time Analytics with Cassandra and Hadoop

Fault tolerance

Page 30: Introduction to Real-Time Analytics with Cassandra and Hadoop

?

Reading from cluster.

Page 31: Introduction to Real-Time Analytics with Cassandra and Hadoop

Reading from cluster.

?

?

?

Page 32: Introduction to Real-Time Analytics with Cassandra and Hadoop

Reading from cluster.

Page 33: Introduction to Real-Time Analytics with Cassandra and Hadoop

Reading from cluster.

Page 34: Introduction to Real-Time Analytics with Cassandra and Hadoop

Deletes• Distributed deletes are tricky• Tombstones may not be propagated• Don’t rely on a delete-heavy system

Page 35: Introduction to Real-Time Analytics with Cassandra and Hadoop

Getting StartedArchitectureData Modeling

Page 36: Introduction to Real-Time Analytics with Cassandra and Hadoop

ProtocolsThrift

• Thrift, CQL• Synchronous

Binary• CQL• Asynchronous

Page 37: Introduction to Real-Time Analytics with Cassandra and Hadoop

• Familiar syntax• Flexible data model over Cassandra

Why CQL?

Page 38: Introduction to Real-Time Analytics with Cassandra and Hadoop

CQL: Creating a Keyspace

create KEYSPACE “Patisserie” with replication = {‘class’: ‘SimpleStrategy’, ‘replication_factor’: 1 } ;

use “Patisserie”;

Page 39: Introduction to Real-Time Analytics with Cassandra and Hadoop

CQL: Creating a Column Family

create TABLE “customers” (customer text, age int, PRIMARY KEY (customer) ) ;

customer age

Yves Laurent 77

Coco Chanel 130

Pierre Cardin

CQL Schema

Page 40: Introduction to Real-Time Analytics with Cassandra and Hadoop

CQL: Creating a Column Family

create TABLE “customers” (customer text, age int, PRIMARY KEY (customer) ) ;

”Yves Laurent”: {“age”:77}

“Coco Chanel”: {“age”:130}

“Pierre Cardin”: {}

Physical Representation

Page 41: Introduction to Real-Time Analytics with Cassandra and Hadoop

CQL: Composite Columnscreate TABLE “customer_purchases” (customer text,

day text,

item text,

PRIMARY KEY (customer,day) ) ;

customer day item

ylaurent M rivoli

ylaurent T mille feuille

cchanel M pain au chocolat

pcardin W mille feuille

pcardin F croissant

Page 42: Introduction to Real-Time Analytics with Cassandra and Hadoop

CQL: Composite Columnscreate TABLE “customer_purchases” (customer text,

day text,

item text,

PRIMARY KEY (customer,day) ) ;

”ylaurent”: { “M:item”: “rivoli”, “T:item”: “mille feuille” }

“cchanel”: { “M:item”: “pain au chocolat” }

“pcardin”: { “W:item”: “mille feuille”, “F:item”: croissant }

Page 43: Introduction to Real-Time Analytics with Cassandra and Hadoop

CQL: Composite Primary Keys

create TABLE “daily_sales_by_item” (day text, customer text, hour timestamp, item text, PRIMARY KEY ((day,customer), hour) ) ;

day customer hour item

M cchanel 13 rivoli

M cchanel 15 mille feuille

M ylaurent 4 rivoli

T cchanel 17 mille feuille

W pcardin 20 croissant

Page 44: Introduction to Real-Time Analytics with Cassandra and Hadoop

CQL: Composite Primary Keys

create TABLE “daily_sales_by_item” (day text, customer text, hour timestamp, item text, PRIMARY KEY ((day,customer), hour) ) ;

”M:cchanel”: { “13:item”: “rivoli”, “15:item”: “mille feuille” }

“M:ylaurent”: { “4:item”: “rivoli” }

“T:cchanel”: { “17:item”: “mille feuille" }

“W:pcardin”: { “20”item”: “croissant” }

Page 45: Introduction to Real-Time Analytics with Cassandra and Hadoop

CQL: Collectionscreate TABLE “customer_purchases” (customer text,

day text,

item list<text>,

PRIMARY KEY (customer,day) ) ;

customer day item

ylaurent M [‘rivoli’, ‘rivoli’, ‘javanais’]

cchanel M [‘pain au chocolat’]

pcardin W [‘mille feuille’, ‘croissant’]

pcardin F [‘croissant’]

Page 46: Introduction to Real-Time Analytics with Cassandra and Hadoop

Data Modeling Labintroduction/2-data-modeling.md

Page 47: Introduction to Real-Time Analytics with Cassandra and Hadoop

Analytics

Page 48: Introduction to Real-Time Analytics with Cassandra and Hadoop

Cassandra and Analytics

Adapting the Data ModelMapReduce Paradigms

Page 49: Introduction to Real-Time Analytics with Cassandra and Hadoop

An Unlikely Union

• Batch processing analytics and real-time data store

• MapReduce, Hive, Pig, Sqoop, Mahout

Page 50: Introduction to Real-Time Analytics with Cassandra and Hadoop

Why Cassandra and Hadoop?

• Unified workload• Availability• Simpler deployment

Page 51: Introduction to Real-Time Analytics with Cassandra and Hadoop

Datastax Enterprise

Data Locality

Data Locality

Data Locality

Page 52: Introduction to Real-Time Analytics with Cassandra and Hadoop

Datastax Enterprise

Task Trackers

Job Tracker

Page 53: Introduction to Real-Time Analytics with Cassandra and Hadoop

CFS

MapReduce

Writing in / out is passed

through the CassandraFS

layer

Page 54: Introduction to Real-Time Analytics with Cassandra and Hadoop

Starting Analytics Node

$ bin/dse cassandra -t -j

# Starts task tracker and job tracker on# node

Page 55: Introduction to Real-Time Analytics with Cassandra and Hadoop

Hello, Wordcount

$ bin/dse hadoop fs -put wikipedia /

$ bin/dse hadoop jar wordcount.jar /wikipedia wc-output

Page 56: Introduction to Real-Time Analytics with Cassandra and Hadoop

Cassandra and HadoopAdapting the Data Model

MapReduce Paradigms

Page 57: Introduction to Real-Time Analytics with Cassandra and Hadoop

Hive

• SQL-like MapReduce abstraction• Data types• Efficient JOINs, GROUP BY

Page 58: Introduction to Real-Time Analytics with Cassandra and Hadoop

Cassandra and Hive

• Hive still has to have separate tables.• DSE stores them in a separate keyspace.• 1:1 mapping to Cassandra CFs• Schemas must match or columns will be

inaccessible.

Page 59: Introduction to Real-Time Analytics with Cassandra and Hadoop

CFS

MapReduce

Hive Metastore is persisted in

Cassandra layer

Hive

Page 60: Introduction to Real-Time Analytics with Cassandra and Hadoop

Hive: Creating a DB

hive> CREATE EXTERNAL TABLE customers ( id string, name string, age int)STORED BY ‘o.a.h.h.cassandra.CassandraStorageHandler’TBLPROPERTIES ( “cassandra.ks.name” = “Oberweis”, “cassandra.ks.repfactor” = “2”, “cassandra.ks.strategy” = “o.a.c.l.SimpleStrategy”);

Page 61: Introduction to Real-Time Analytics with Cassandra and Hadoop

Hive: Multiple Data Centers

hive> CREATE EXTERNAL TABLE customers ( id string, name string, age int)STORED BY ‘o.a.h.h.cassandra.CassandraStorageHandler’TBLPROPERTIES ( “cassandra.ks.name” = “Oberweis”, “cassandra.ks.stratOptions” = “DC1:3, DC2:1”, “cassandra.ks.strategy” = “o.a.c.l.NTStrategy”);

Page 62: Introduction to Real-Time Analytics with Cassandra and Hadoop

• What about composite columns?

• Must be retrieved as binary data, and then use UDF to deserialize it.

Hive

Page 63: Introduction to Real-Time Analytics with Cassandra and Hadoop

• For each person, calculate how many pastries (and of what kind) they purchased.

Hive: Lab

Page 64: Introduction to Real-Time Analytics with Cassandra and Hadoop

Hive: Multiple Data Centers

hive> SELECT b.name, a.item, sum(a.amount)FROM Oberweis.daily_purchases aJOIN Oberweis.person b ON (a.person = b.id)GROUP BY b.name, a.item;

Page 65: Introduction to Real-Time Analytics with Cassandra and Hadoop

Extra Credit

Page 66: Introduction to Real-Time Analytics with Cassandra and Hadoop

• What about real time?

• Neither Hadoop nor Hive are built for real-time

• Cassandra provides you with data locality

Real Time Considerations

Page 67: Introduction to Real-Time Analytics with Cassandra and Hadoop

Cassandra 2.0• Transactions• Triggers• Prepared Statements

Page 68: Introduction to Real-Time Analytics with Cassandra and Hadoop

#strataconf + #hw2013

Q&A

@[email protected] on IRC (#cassandra, #python)