Kite SDK introduction for Portland Big Data

29
Kite SDK: It’s for developers Ryan Blue, Software Engineer

description

Kite SDK is a set of tools for building big data applications on Hadoop.

Transcript of Kite SDK introduction for Portland Big Data

Page 1: Kite SDK introduction for Portland Big Data

Kite SDK: It’s for developersRyan Blue, Software Engineer

Page 2: Kite SDK introduction for Portland Big Data

Resources

©2014 Cloudera, Inc. All rights reserved.

• Kite guide• http://tiny.cloudera.com/KiteGuide

• Dataset overview and intro• http://tiny.cloudera.com/Datasets

• Command-line tutorial• http://tiny.cloudera.com/KiteCLI

• Kite repository and examples• https://github.com/kite-sdk/kite• https://github.com/kite-sdk/kite-examples

Page 3: Kite SDK introduction for Portland Big Data

Agenda

©2014 Cloudera, Inc. All rights reserved.

• Kite background• Kite data

Page 4: Kite SDK introduction for Portland Big Data

What problem does Kite solve?

©2014 Cloudera, Inc. All rights reserved.

• Accessibility for getting started• Easy to get started, without being an expert• Use before understanding

• Save time for experienced developers• Off-the-shelf tools for common tasks• Quickly iterate and test configurations

Page 5: Kite SDK introduction for Portland Big Data

Kite Datasets: Motivation

©2014 Cloudera, Inc. All rights reserved.

• Focus on using data, not managing files• Developers shouldn’t have to maintain data files• Use through configuration, not code• Need consistency across the platform

Page 6: Kite SDK introduction for Portland Big Data

Kite Datasets: Motivation

©2014 Cloudera, Inc. All rights reserved.

Application

Database

Data files

User code

Provided

Maintained by the database

Page 7: Kite SDK introduction for Portland Big Data

Kite Datasets: Motivation

©2014 Cloudera, Inc. All rights reserved.

Application Application

Database

Data files

Data files HBase

User code

Page 8: Kite SDK introduction for Portland Big Data

Kite Datasets: Motivation

©2014 Cloudera, Inc. All rights reserved.

Application ApplicationApplication

Database

Data files

Data files

Kite Data

HBaseData files HBase

Maintained by the Kite

Page 9: Kite SDK introduction for Portland Big Data

Kite Datasets: Goals

©2014 Cloudera, Inc. All rights reserved.

• Think in terms of data: datasets, views, records• Describe data, layout and Kite does the right thing• Should work consistently across the platform• Reliable

Page 10: Kite SDK introduction for Portland Big Data

Kite Datasets: Compatibility

©2014 Cloudera, Inc. All rights reserved.

Project HDFS (avro) HDFS (parquet) HBase

Kite 1.0 1.0 1.0

Flume Sink 1.0 1.0 1.0

MapReduce 1.0 1.0 1.0

Crunch 1.0 1.0 1.0

Hive 1.0 1.0 1.1

Impala 1.0 1.0 *

* depends on common HBase encoding format

Page 11: Kite SDK introduction for Portland Big Data

Current compatibility (0.15.0)

©2014 Cloudera, Inc. All rights reserved.

Project HDFS (avro) HDFS (parquet) HBase

Kite 1.0 1.0 1.0

Flume Sink 1.0 1.0 1.0

MapReduce 1.0 1.0 1.0

Crunch 1.0 1.0 1.0

Hive 1.0 1.0 1.1

Impala 1.0 1.0 *

* depends on common HBase encoding format

Page 12: Kite SDK introduction for Portland Big Data

Agenda

©2014 Cloudera, Inc. All rights reserved.

• Kite background• Kite data

Application

Kite Data

Data files HBase

Maintained by the Kite

Page 13: Kite SDK introduction for Portland Big Data

Datasets

©2014 Cloudera, Inc. All rights reserved.

• A collection of records or entities• Like a Hive or relational table• Generic, reflected, or generated objects

• Identified by URI• dataset:hdfs:/data/ratings• dataset:hive:/data/ratings• dataset:hbase:zk1/ratings

ratings = Datasets.load("dataset:hive:/data/ratings")

Page 14: Kite SDK introduction for Portland Big Data

Dataset configuration, JSON

©2014 Cloudera, Inc. All rights reserved.

• Schema (Avro)• Record fields, like a table definition

Page 15: Kite SDK introduction for Portland Big Data

Dataset configuration, JSON

©2014 Cloudera, Inc. All rights reserved.

• Schema (Avro)• Record fields, like a table definition

• Partition strategy• Layout or key definition from record fields

Page 16: Kite SDK introduction for Portland Big Data

Configuring partitioning

©2014 Cloudera, Inc. All rights reserved.

• Partition strategy[ { "source" : "timestamp", "type" : "year"}, { "source" : "timestamp", "type" : "month"}, { "source" : "timestamp", "type" : "day"} ]

datasets/└── ratings/ ├── year=1997/ │ ├── month=09/ │ │ ├── day=20/ │ │ ├── ... │ │ └── day=30/ │ ├── month=10/ │ │ ├── day=01/ │ │ ├── ...

Page 17: Kite SDK introduction for Portland Big Data

Configuring key building

©2014 Cloudera, Inc. All rights reserved.

• Partition strategy for HBase[ { "source" : "email", "type" : "hash", "buckets": 32}, { "source" : "email", "type" : "identity"} ]

(22, "[email protected]")

\x80\x00\x00\[email protected]\x00\x00

Page 18: Kite SDK introduction for Portland Big Data

Dataset configuration, JSON

©2014 Cloudera, Inc. All rights reserved.

• Schema (Avro)• Record fields, like a table definition

• Partition strategy• Layout or key definition from record fields

• Column mapping (HBase)• Where to store record fields

Page 19: Kite SDK introduction for Portland Big Data

{ "type" : "record", "name" : "User", "fields" : [ { "name" : "email", "type" : "string" }, ... ]}

Mapping example

©2014 Cloudera, Inc. All rights reserved.

family name counts prefs

row key last first visits flash

[email protected] Lightyear Buzz 315 true

[ { "source": "email", "type": "key" }, ...]

Page 20: Kite SDK introduction for Portland Big Data

{ "type" : "record", "name" : "User", "fields" : [ { "name" : "lastName", "type" : "string" }, ... ]}

Mapping example

©2014 Cloudera, Inc. All rights reserved.

family name counts prefs

row key last first visits flash

[email protected] Lightyear Buzz 315 true

[ { "source": "lastName", "type": "column", "family": "name", "qualifier": "last" }, ...]

Page 21: Kite SDK introduction for Portland Big Data

Command-line demo?

©2014 Cloudera, Inc. All rights reserved.

1. Describe your datadataset obj-schema org.movielens.Rating --jar app.jar \ --output rating.avsc

2. Describe your layoutdataset partition-config ts:year ts:month ts:day \ --schema rating.avsc --output ymd.json

3. Create a datasetdataset create ratings --schema rating.avsc \ --partition-by ymd.json

Page 22: Kite SDK introduction for Portland Big Data

Command-line tool

©2014 Cloudera, Inc. All rights reserved.

• Executable jar download• Inspects the environment

• Must be used on-cluster• Classpath for HBase, Hive, etc.

• Debugging: debug=true ./dataset -v <command>

• Requires MAPRED_HOME variable on CDH5

Page 23: Kite SDK introduction for Portland Big Data

Resources

©2014 Cloudera, Inc. All rights reserved.

• Kite guide• http://tiny.cloudera.com/KiteGuide

• Dataset overview and intro• http://tiny.cloudera.com/Datasets

• Command-line tutorial• http://tiny.cloudera.com/KiteCLI

• Kite repository and examples• https://github.com/kite-sdk/kite• https://github.com/kite-sdk/kite-examples

Page 24: Kite SDK introduction for Portland Big Data

Questions

©2014 Cloudera, Inc. All rights reserved.

Ryan Blue: [email protected] mailing list: [email protected]

Page 25: Kite SDK introduction for Portland Big Data

Maven parent POM

©2014 Cloudera, Inc. All rights reserved.

• Automatic Kite and Hadoop dependencies• Inherit from kite-app-parent-cdh4• CDH4 only, CDH5 support in 0.16.0

<parent> <groupId>org.kitesdk</groupId> <artifactId>kite-app-parent-cdh4</artifactId> <version>0.15.0</version> </parent>

Page 26: Kite SDK introduction for Portland Big Data

Maven Plugin

©2014 Cloudera, Inc. All rights reserved.

• Maven plugin manages datasets for an application• Configured by app-parent POM• Handles create, update, etc. in maven goals

Page 27: Kite SDK introduction for Portland Big Data

MapReduce

©2014 Cloudera, Inc. All rights reserved.

• DatasetKeyInputFormat• DatasetKeyOutputFormat• Values are always null

View eventsBeforeToday = Datasets .load("dataset:hive:/data/events") .toBefore("timestamp", startOfToday());

DatasetKeyInputFormat.configure(mrJob).readFrom(eventsBeforeToday);

Page 28: Kite SDK introduction for Portland Big Data

Crunch

©2014 Cloudera, Inc. All rights reserved.

• CrunchDatasets.asSource• CrunchDatasets.asTarget

PCollection<Event> getPipeline().read( CrunchDatasets.asSource(eventsBeforeToday);

• Handle-existing support in 0.16.0• Configure dependencies with Kite parent POM

Page 29: Kite SDK introduction for Portland Big Data

DatasetSink

©2014 Cloudera, Inc. All rights reserved.

• Write to HDFS Avro and HBase• http://tiny.cloudera.com/DatasetSink

• Proxy user support• Automatic partitioning

agent.sinks.name.type = org.apache.flume.sink.kite.DatasetSinkagent.sinks.name.kite.repo.uri = repo:hdfs:/datasetsagent.sinks.name.kite.dataset.name = eventsagent.sinks.name.auth.proxyUser = cloudera