Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache...

44
Building a Flexible, Real-time Big Data Applications Platform on Cassandra with Kiji Cassandra Day Silicon Valley 07 April 2014 Clint Kelly Member of Technical Staff WibiData 1

description

The talk is about the process of adding support for Cassandra in Kiji, our open-source platform for building big-data applications. I start off by describing the Kiji project, how it enables folks to build big-data applications, and (hopefully) get everyone excited about it. Then I talk about the Kiji data model, its origins in HBase (we initially built Kiji on top of HBase), how we updated it to also support Cassandra, what some if the issues were, etc. I get into some detail about our use of the Java driver and its async API, how we translate operations in Kiji into CQL statements, and some enhancements we've made to the Hadoop InputFormat and OutputFormat. I think this talk will be interesting to folks in general, and in particular will be useful for anyone who has an HBase background and is now working with Cassandra. The Kiji Project is a modular, open-source framework that enables developers to efficiently build real-time Big Data applications. Kiji is built upon popular open-source technologies such as Cassandra, HBase, Hadoop, and Scalding, and contains components that implement functionality critical for Big Data applications, including the following: Support for evolvable schemas of complex data types Batch training of machine learning models with Hadoop Real-time scoring with trained models Integration with Hive and R A REST endpoint Recently, we have updated Kiji to use Cassandra as a backing data store (previously, Kiji worked only with HBase). In this talk, we describe the process of integrating Cassandra and Kiji. Topics we cover include the following: The Kiji architecture and data model Implementing the Kiji data model in Cassandra using the Java driver and CQL3 Integrating Cassandra with Hadoop 2.x Building a flexible middleware platform that supports Cassandra and HBase (including projects that use both simultaneously) Exposing unique features of Cassandra (e.g., variable consistency) to Kiji users

Transcript of Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache...

Page 1: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Building a Flexible, Real-time Big Data Applications Platform

on Cassandra with Kiji

Cassandra Day Silicon Valley07 April 2014

Clint KellyMember of Technical StaffWibiData

1

Page 2: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Overview

• The Kiji Project• The Kiji data model and KijiSchema• Mapping Kiji to Cassandra• Status and future work• Try it now!

2

Should there be any intro page that talks about WibiData anywhere?

Page 3: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

The Kiji Project

3

Page 4: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

4

!

Want to build this...

Page 5: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Have this...

5

!

Want to build this...

Page 6: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

!

Have this...

6

Want to build this...

Page 7: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Open source components

• Batch processing– Extract, transform, load– Train machine learning models

• Scalable storage– Time-series data

• Serialization– Complex data types

7

Hadoop, C*, HBase, Avro

KijiSchema

KijiMR KijiREST

KijiHive KijiScoring

KijiExpress

Page 8: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

KijiSchema

• Schemas and data serialization• Complex, atomic data types

8

Hadoop, C*, HBase, Avro

KijiSchema

KijiMR KijiREST

KijiHive KijiScoring

KijiExpress

record UserLog { long timestamp; int user_id; string url; long session_id;}

• Schema evolution• Table metadata

Page 9: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Kiji batch components

• Scala DSL ➔ describe MapReduce computations

• Machine learning library• Hive adapter

9

Hadoop, C*, HBase, Avro

KijiSchema

KijiMR KijiREST

KijiHive KijiScoring

KijiExpress

Page 10: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Kiji real-time components

• REST server• Scoring server

10

Hadoop, C*, HBase, Avro

KijiSchema

KijiMR KijiREST

KijiHive KijiScoring

KijiExpress

Page 11: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Kiji Summary

• Bridge between open-source technologies and real-time, big data applications

• Users are building real systems with Kiji now!– Personalized recommendation systems for retail– Energy usage and analytics reporting

11

Page 12: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

The Kiji data model and KijiSchema

12

Page 13: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

row

13

Table are composed of rows.

Page 14: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

entity ID data

14

We call row keys “entity IDs.”

Page 15: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

data0xfa “bob”

15

We support composite entity IDs (with hashed and unhashed components).

Page 16: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

info0xfa “bob” songs

16

Data in rows is organized into “column families.”

Page 17: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

songs:let it be

songs:help

songs:helterskelter

0xfa “bob” info:email

info:payment

17

Column families contain columns, named as “family:qualifier.”

Page 18: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

songs:let it be

songs:help

songs:helterskelter

0xfa “bob” info:email

info:payment songs:

let it besongs:let it besongs:

let it besongs:let it be

1396560123

18

Individual columns can have many different timestamped versions.

Page 19: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

songs:let it be

songs:help

songs:helterskelter

0xfa “bob” info:email

info:payment songs:

let it besongs:let it besongs:

let it besongs:let it be

1396560123

19

Data values can be complex records

record SongPlay { long song_id; int user_rating; long session_id; device_type device;}

Page 20: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

20

Locality groups

Separate logical organization of data (column families) from physical attributes (caching, compression, etc.)

info songs_todayentity ID songs_prev_year

Page 21: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

21

Locality groups

Separate logical organization of data (column families) from physical attributes (caching, compression, etc.)

Need this data ASAP for real-time scoring. Use this data only for

batch jobs.

info songs_todayentity ID songs_prev_year

Page 22: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

info songs_todayentity ID songs_prev_year

“real_time” (in-memory, uncompressed, TTL = 1 day)

“batch” (compressed, TTL = 12mo)

22

Locality groups

Always refer to columns by logical name (“family:qualifier”).

Need this data ASAP for real-time scoring. Use this data only for

batch jobs.

Page 23: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

KijiSchema summary

• Data model similar to Cassandra, HBase, BigTable

• Contains time dimension (not present in C*)• Logical and physical organization separate• Complex schemas with Avro

23

Page 24: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Mapping Kiji to Cassandra

24

Page 25: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Implementation notes

25

• Built for Cassandra 2.0.6+• Native protocol / Java driver (no Thrift)• Asynchronous API• Assume users have Hadoop, ZooKeeper

Page 26: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Mapping a Kiji table ➔ Cassandra

• Locality group ➔ Table• Entity ID ➔ Primary key

– Hashed components ➔ partition key– Unhashed components ➔ clustering columns

• Family, qualifier, timestamp ➔ clustering columns• Cell values ➔ blobs

26

songs:let it be

songs:help

songs:helterskelter

0xfa “bob” info:email

info:payment songs:

let it besongs:let it besongs:

let it besongs:let it be

1396560123

Page 27: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

CQL for Kiji locality groupCREATE TABLE users_locality_group_fast ( userid bigint, user text, family text, qualifier text, timestamp bigint, value blob, PRIMARY KEY (userid, username, family, qualifier, timestamp)) WITH CLUSTERING ORDER BY ( username ASC, family ASC, qualifier ASC, timestamp DESC);

27

TODO: Show row diagram, arrows pointing to components?

Page 28: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

28

cqlsh:kiji_music>SELECT * FROM kiji_table_users;

userid | username | family | qualifier | timestamp | value--------+----------+--------+----------------+-----------+--------------- 0xfa | bob | info | email | 139653249 | 1243970104327 0xfa | bob | songs | abbey road | 139656012 | 0981274331032 0xfa | bob | songs | help | 139625013 | 9074132704129 0xfa | bob | songs | help | 139621359 | 1923079210370 0xfa | bob | songs | help | 139625013 | 4745018223497 0xfa | bob | songs | helter skelter | 139621324 | 7710423974234

Page 29: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Physical organization of data on disk

29

songs:let it be

songs:help

songs:helterskelter

0xfa “bob” info:email

info:payment songs:

let it besongs:let it besongs:

let it besongs:let it be

13965601230xfa:bob:info:email:t0:[email protected]

0xfa:bob:info:payment:t1:AMEX1234...

0xfa:bob:songs:let it be:t5:...

0xfa:bob:songs:let it be:t4:…

0xfa:bob:songs:let it be:t2:…

0xfa:bob:songs:help:t2:…

0xfa:bob:songs:helter skelter:t1:…

Efficient queries = continuous scans!

Page 30: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Kiji queries ➔ CQL queries

All data in “info” column family for “bob” ➔SELECT qualifier, value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘info’;

30

songs:let it be

songs:help

songs:helterskelter

0xfa “bob” info:email

info:payment songs:

let it besongs:let it besongs:

let it besongs:let it be

1396560123

Page 31: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Kiji queries ➔ CQL queries

Data in “info:email” and last play of “help” for “bob” ➔

SELECT value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘info’ AND qualifier=‘email’;

SELECT value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘songs’ AND qualifier=‘help’ LIMIT 1;

31

songs:let it be

songs:help

songs:helterskelter

0xfa “bob” info:email

info:payment songs:

let it besongs:let it besongs:

let it besongs:let it be

1396560123

Page 32: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Kiji queries ➔ CQL queries

All songs played by “bob” on April 2nd ➔SELECT qualifier, value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘songs’ AND timestamp >= 1396396800 AND timestamp <= 1396483200 ALLOW FILTERING;😱😱

32

songs:let it be

songs:help

songs:helterskelter

0xfa “bob” info:email

info:payment songs:

let it besongs:let it besongs:

let it besongs:let it be

1396560123

Page 33: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Kiji queries ➔ CQL queries

33

songs:let it be

songs:help

songs:helterskelter

0xfa “bob” info:email

info:payment songs:

let it besongs:let it besongs:

let it besongs:let it be

1396560123

!Bad Request: PRIMARY KEY part timestamp cannot be restricted (preceding part qualifier is either not restricted or by a non-EQ relation)

Page 34: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Queries that do not map well to CQL

• Break up into multiple CQL queries– Hooray for Session#executeAsync!

• Filter on the client– Potentially very expensive, but functional– Provide warning to user

• Educate users about table layout– Layout in previous example is terrible for that query

• Most issues related to “time” dimension

34

Page 35: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

MapReduce

• Wrote new InputFormat, OutputFormat• Hadoop 2.x• Multiple C* queries per RecordReader• Does not use Thrift

35

Page 36: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Project status and next steps

36

Page 37: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Initial release in ~ 2 weeks

37

• Cassandra as part of the Bento Box• Cassandra working in KijiSchema, KijiMR

Page 38: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Support in the coming months

• Cassandra integration with KijiREST, KijiScoring, KijiExpress, etc.

• Expose Cassandra-specific features to users– Variable consistency levels– Load-balancing policies– Diagnostics (e.g., route tracing)

• Kiji support in CQLSH– Decode Avro values

38

Page 39: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Thanks to Cassandra community

• Great help on mailing lists for users, dev, java driver

• Webinars, meetups, C* Summit all available online

• Free training from DataStax• Very easy to get up-to-speed

39

Page 40: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Try it now -- Kiji Bento Box

• Latest compatible versions of all components• Hadoop, ZooKeeper, HBase• Cassandra in ~2 weeks

40

www.kiji.org/getstarted

Mention hiring?

Page 41: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

KijiSchema

• Schemas and data serialization• Complex data types (e.g.,

nested maps)• Schema evolution• Metadata• Composite row keys• Transparent paging• Data-definition language, REPL

41

Hadoop, C*, HBase, Avro

KijiSchema

KijiMR KijiREST

KijiHive KijiScoring

KijiExpress

Page 42: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

42

Schema support

Support for complex schemas with Avrorecord UserLog { long timestamp; int user_id; string url;}

KijiSchema allows schema versioning

Page 43: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

43

Column name translation

• “family:qualifier” -> “A:B”• Saves disk space• Improves performance• User-facing tools translate names• Possible to turn this off

Page 44: Cassandra Day SV 2014: Building a Flexible, Real-time Big Data Applications Platform on Apache Cassandra with Kiji

Kiji queries ➔ CQL queries

All data in family “songs” for user “bob” ➔SELECT qualifier, value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘songs’;

44

songs:let it be

songs:help

songs:helterskelter

0xfa “bob” info:email

info:payment songs:

let it besongs:let it besongs:

let it besongs:let it be

1396560123