Kite SDK introduction for Portland Big Data

Post on 06-May-2015

342 views 2 download

description

Kite SDK is a set of tools for building big data applications on Hadoop.

Transcript of Kite SDK introduction for Portland Big Data

Kite SDK: It’s for developersRyan Blue, Software Engineer

Resources

©2014 Cloudera, Inc. All rights reserved.

• Kite guide• http://tiny.cloudera.com/KiteGuide

• Dataset overview and intro• http://tiny.cloudera.com/Datasets

• Command-line tutorial• http://tiny.cloudera.com/KiteCLI

• Kite repository and examples• https://github.com/kite-sdk/kite• https://github.com/kite-sdk/kite-examples

Agenda

©2014 Cloudera, Inc. All rights reserved.

• Kite background• Kite data

What problem does Kite solve?

©2014 Cloudera, Inc. All rights reserved.

• Accessibility for getting started• Easy to get started, without being an expert• Use before understanding

• Save time for experienced developers• Off-the-shelf tools for common tasks• Quickly iterate and test configurations

Kite Datasets: Motivation

©2014 Cloudera, Inc. All rights reserved.

• Focus on using data, not managing files• Developers shouldn’t have to maintain data files• Use through configuration, not code• Need consistency across the platform

Kite Datasets: Motivation

©2014 Cloudera, Inc. All rights reserved.

Application

Database

Data files

User code

Provided

Maintained by the database

Kite Datasets: Motivation

©2014 Cloudera, Inc. All rights reserved.

Application Application

Database

Data files

Data files HBase

User code

Kite Datasets: Motivation

©2014 Cloudera, Inc. All rights reserved.

Application ApplicationApplication

Database

Data files

Data files

Kite Data

HBaseData files HBase

Maintained by the Kite

Kite Datasets: Goals

©2014 Cloudera, Inc. All rights reserved.

• Think in terms of data: datasets, views, records• Describe data, layout and Kite does the right thing• Should work consistently across the platform• Reliable

Kite Datasets: Compatibility

©2014 Cloudera, Inc. All rights reserved.

Project HDFS (avro) HDFS (parquet) HBase

Kite 1.0 1.0 1.0

Flume Sink 1.0 1.0 1.0

MapReduce 1.0 1.0 1.0

Crunch 1.0 1.0 1.0

Hive 1.0 1.0 1.1

Impala 1.0 1.0 *

* depends on common HBase encoding format

Current compatibility (0.15.0)

©2014 Cloudera, Inc. All rights reserved.

Project HDFS (avro) HDFS (parquet) HBase

Kite 1.0 1.0 1.0

Flume Sink 1.0 1.0 1.0

MapReduce 1.0 1.0 1.0

Crunch 1.0 1.0 1.0

Hive 1.0 1.0 1.1

Impala 1.0 1.0 *

* depends on common HBase encoding format

Agenda

©2014 Cloudera, Inc. All rights reserved.

• Kite background• Kite data

Application

Kite Data

Data files HBase

Maintained by the Kite

Datasets

©2014 Cloudera, Inc. All rights reserved.

• A collection of records or entities• Like a Hive or relational table• Generic, reflected, or generated objects

• Identified by URI• dataset:hdfs:/data/ratings• dataset:hive:/data/ratings• dataset:hbase:zk1/ratings

ratings = Datasets.load("dataset:hive:/data/ratings")

Dataset configuration, JSON

©2014 Cloudera, Inc. All rights reserved.

• Schema (Avro)• Record fields, like a table definition

Dataset configuration, JSON

©2014 Cloudera, Inc. All rights reserved.

• Schema (Avro)• Record fields, like a table definition

• Partition strategy• Layout or key definition from record fields

Configuring partitioning

©2014 Cloudera, Inc. All rights reserved.

• Partition strategy[ { "source" : "timestamp", "type" : "year"}, { "source" : "timestamp", "type" : "month"}, { "source" : "timestamp", "type" : "day"} ]

datasets/└── ratings/ ├── year=1997/ │ ├── month=09/ │ │ ├── day=20/ │ │ ├── ... │ │ └── day=30/ │ ├── month=10/ │ │ ├── day=01/ │ │ ├── ...

Configuring key building

©2014 Cloudera, Inc. All rights reserved.

• Partition strategy for HBase[ { "source" : "email", "type" : "hash", "buckets": 32}, { "source" : "email", "type" : "identity"} ]

(22, "buzz@pixar.com")

\x80\x00\x00\x16buzz@pixar.com\x00\x00

Dataset configuration, JSON

©2014 Cloudera, Inc. All rights reserved.

• Schema (Avro)• Record fields, like a table definition

• Partition strategy• Layout or key definition from record fields

• Column mapping (HBase)• Where to store record fields

{ "type" : "record", "name" : "User", "fields" : [ { "name" : "email", "type" : "string" }, ... ]}

Mapping example

©2014 Cloudera, Inc. All rights reserved.

family name counts prefs

row key last first visits flash

buzz@pixar.com Lightyear Buzz 315 true

[ { "source": "email", "type": "key" }, ...]

{ "type" : "record", "name" : "User", "fields" : [ { "name" : "lastName", "type" : "string" }, ... ]}

Mapping example

©2014 Cloudera, Inc. All rights reserved.

family name counts prefs

row key last first visits flash

buzz@pixar.com Lightyear Buzz 315 true

[ { "source": "lastName", "type": "column", "family": "name", "qualifier": "last" }, ...]

Command-line demo?

©2014 Cloudera, Inc. All rights reserved.

1. Describe your datadataset obj-schema org.movielens.Rating --jar app.jar \ --output rating.avsc

2. Describe your layoutdataset partition-config ts:year ts:month ts:day \ --schema rating.avsc --output ymd.json

3. Create a datasetdataset create ratings --schema rating.avsc \ --partition-by ymd.json

Command-line tool

©2014 Cloudera, Inc. All rights reserved.

• Executable jar download• Inspects the environment

• Must be used on-cluster• Classpath for HBase, Hive, etc.

• Debugging: debug=true ./dataset -v <command>

• Requires MAPRED_HOME variable on CDH5

Resources

©2014 Cloudera, Inc. All rights reserved.

• Kite guide• http://tiny.cloudera.com/KiteGuide

• Dataset overview and intro• http://tiny.cloudera.com/Datasets

• Command-line tutorial• http://tiny.cloudera.com/KiteCLI

• Kite repository and examples• https://github.com/kite-sdk/kite• https://github.com/kite-sdk/kite-examples

Questions

©2014 Cloudera, Inc. All rights reserved.

Ryan Blue: blue@cloudera.comKite mailing list: cdk-dev@cloudera.org

Maven parent POM

©2014 Cloudera, Inc. All rights reserved.

• Automatic Kite and Hadoop dependencies• Inherit from kite-app-parent-cdh4• CDH4 only, CDH5 support in 0.16.0

<parent> <groupId>org.kitesdk</groupId> <artifactId>kite-app-parent-cdh4</artifactId> <version>0.15.0</version> </parent>

Maven Plugin

©2014 Cloudera, Inc. All rights reserved.

• Maven plugin manages datasets for an application• Configured by app-parent POM• Handles create, update, etc. in maven goals

MapReduce

©2014 Cloudera, Inc. All rights reserved.

• DatasetKeyInputFormat• DatasetKeyOutputFormat• Values are always null

View eventsBeforeToday = Datasets .load("dataset:hive:/data/events") .toBefore("timestamp", startOfToday());

DatasetKeyInputFormat.configure(mrJob).readFrom(eventsBeforeToday);

Crunch

©2014 Cloudera, Inc. All rights reserved.

• CrunchDatasets.asSource• CrunchDatasets.asTarget

PCollection<Event> getPipeline().read( CrunchDatasets.asSource(eventsBeforeToday);

• Handle-existing support in 0.16.0• Configure dependencies with Kite parent POM

DatasetSink

©2014 Cloudera, Inc. All rights reserved.

• Write to HDFS Avro and HBase• http://tiny.cloudera.com/DatasetSink

• Proxy user support• Automatic partitioning

agent.sinks.name.type = org.apache.flume.sink.kite.DatasetSinkagent.sinks.name.kite.repo.uri = repo:hdfs:/datasetsagent.sinks.name.kite.dataset.name = eventsagent.sinks.name.auth.proxyUser = cloudera