Kite SDK introduction for Portland Big Data

Kite SDK: It’s for developersRyan Blue, Software Engineer

Resources

• Kite guide• http://tiny.cloudera.com/KiteGuide

• Dataset overview and intro• http://tiny.cloudera.com/Datasets

• Command-line tutorial• http://tiny.cloudera.com/KiteCLI

• Kite repository and examples• https://github.com/kite-sdk/kite• https://github.com/kite-sdk/kite-examples

Agenda

• Kite background• Kite data

What problem does Kite solve?

• Accessibility for getting started• Easy to get started, without being an expert• Use before understanding

• Save time for experienced developers• Off-the-shelf tools for common tasks• Quickly iterate and test configurations

Kite Datasets: Motivation

• Focus on using data, not managing files• Developers shouldn’t have to maintain data files• Use through configuration, not code• Need consistency across the platform

Application

Database

Data files

User code

Provided

Maintained by the database

Application Application

Database

Data files

Data files HBase

User code

Application ApplicationApplication

Database

Data files

Kite Data

HBaseData files HBase

Maintained by the Kite

Kite Datasets: Goals

• Think in terms of data: datasets, views, records• Describe data, layout and Kite does the right thing• Should work consistently across the platform• Reliable

Kite Datasets: Compatibility

Project HDFS (avro) HDFS (parquet) HBase

Kite 1.0 1.0 1.0

Flume Sink 1.0 1.0 1.0

MapReduce 1.0 1.0 1.0

Crunch 1.0 1.0 1.0

Hive 1.0 1.0 1.1

Impala 1.0 1.0 *

* depends on common HBase encoding format

Current compatibility (0.15.0)

Project HDFS (avro) HDFS (parquet) HBase

Kite 1.0 1.0 1.0

Flume Sink 1.0 1.0 1.0

MapReduce 1.0 1.0 1.0

Crunch 1.0 1.0 1.0

Hive 1.0 1.0 1.1

Impala 1.0 1.0 *

* depends on common HBase encoding format

Agenda

• Kite background• Kite data

Application

Kite Data

Data files HBase

Maintained by the Kite

Datasets

• A collection of records or entities• Like a Hive or relational table• Generic, reflected, or generated objects

• Identified by URI• dataset:hdfs:/data/ratings• dataset:hive:/data/ratings• dataset:hbase:zk1/ratings

ratings = Datasets.load("dataset:hive:/data/ratings")

Dataset configuration, JSON

• Schema (Avro)• Record fields, like a table definition

• Partition strategy• Layout or key definition from record fields

Configuring partitioning

• Partition strategy[ { "source" : "timestamp", "type" : "year"}, { "source" : "timestamp", "type" : "month"}, { "source" : "timestamp", "type" : "day"} ]

datasets/└── ratings/ ├── year=1997/ │ ├── month=09/ │ │ ├── day=20/ │ │ ├── ... │ │ └── day=30/ │ ├── month=10/ │ │ ├── day=01/ │ │ ├── ...

Configuring key building

• Partition strategy for HBase[ { "source" : "email", "type" : "hash", "buckets": 32}, { "source" : "email", "type" : "identity"} ]

(22, "buzz@pixar.com")

\x80\x00\x00\x16buzz@pixar.com\x00\x00

• Partition strategy• Layout or key definition from record fields

• Column mapping (HBase)• Where to store record fields

{ "type" : "record", "name" : "User", "fields" : [ { "name" : "email", "type" : "string" }, ... ]}

Mapping example

family name counts prefs

row key last first visits flash

buzz@pixar.com Lightyear Buzz 315 true

[ { "source": "email", "type": "key" }, ...]

{ "type" : "record", "name" : "User", "fields" : [ { "name" : "lastName", "type" : "string" }, ... ]}

Mapping example

family name counts prefs

row key last first visits flash

buzz@pixar.com Lightyear Buzz 315 true

[ { "source": "lastName", "type": "column", "family": "name", "qualifier": "last" }, ...]

Command-line demo?

1. Describe your datadataset obj-schema org.movielens.Rating --jar app.jar \ --output rating.avsc

2. Describe your layoutdataset partition-config ts:year ts:month ts:day \ --schema rating.avsc --output ymd.json

3. Create a datasetdataset create ratings --schema rating.avsc \ --partition-by ymd.json

Command-line tool

• Executable jar download• Inspects the environment

• Must be used on-cluster• Classpath for HBase, Hive, etc.

• Debugging: debug=true ./dataset -v <command>

• Requires MAPRED_HOME variable on CDH5

Resources

• Kite guide• http://tiny.cloudera.com/KiteGuide

• Dataset overview and intro• http://tiny.cloudera.com/Datasets

• Command-line tutorial• http://tiny.cloudera.com/KiteCLI

• Kite repository and examples• https://github.com/kite-sdk/kite• https://github.com/kite-sdk/kite-examples

Questions

Ryan Blue: blue@cloudera.comKite mailing list: cdk-dev@cloudera.org

Maven parent POM

• Automatic Kite and Hadoop dependencies• Inherit from kite-app-parent-cdh4• CDH4 only, CDH5 support in 0.16.0

<parent> <groupId>org.kitesdk</groupId> <artifactId>kite-app-parent-cdh4</artifactId> <version>0.15.0</version> </parent>

Maven Plugin

• Maven plugin manages datasets for an application• Configured by app-parent POM• Handles create, update, etc. in maven goals

MapReduce

• DatasetKeyInputFormat• DatasetKeyOutputFormat• Values are always null

View eventsBeforeToday = Datasets .load("dataset:hive:/data/events") .toBefore("timestamp", startOfToday());

DatasetKeyInputFormat.configure(mrJob).readFrom(eventsBeforeToday);

Crunch

• CrunchDatasets.asSource• CrunchDatasets.asTarget

PCollection<Event> getPipeline().read( CrunchDatasets.asSource(eventsBeforeToday);

• Handle-existing support in 0.16.0• Configure dependencies with Kite parent POM

DatasetSink

• Write to HDFS Avro and HBase• http://tiny.cloudera.com/DatasetSink

• Proxy user support• Automatic partitioning

agent.sinks.name.type = org.apache.flume.sink.kite.DatasetSinkagent.sinks.name.kite.repo.uri = repo:hdfs:/datasetsagent.sinks.name.kite.dataset.name = eventsagent.sinks.name.auth.proxyUser = cloudera

Kite SDK introduction for Portland Big Data

Data & Analytics

Transcript of Kite SDK introduction for Portland Big Data

Launching KITE

HOW TO FLY A KITE - Amer Kite Assoc

HBase Data Modeling and Access Patterns with Kite SDK

Kite Manual

PA-Bizhub-B2-20150312103310€¦ · Kite Software Development Kit Aggregating Records Crunch Purpose with About Kite o Kite Background o Datasets Overview Kite CLI Installing the

Kite Presentation

Kite eletronica

Invacare® Kite - Better Mobility Kite Brochure.pdf · Invacare® Kite® Unique hybrid powerchair The Invacare Kite is unique in performance, driving comfort, compactness and personalisation.

Birdhead Kite

Energy kite

Kite energy

Aerostat KITE

Kite History

Kites India - Indian Kite - Kite Museum India

KUMASI INSTITUTE OF TECHNOLOGY AND ENVIRONMENT (KITE) KITE

Math : Kite

Kite & Connect Brésil : KITE & SAMBA

Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

MADNESS KITE

Kite project