Cloudera Developer Kit (CDK)
-
Upload
elephantscale -
Category
Technology
-
view
88 -
download
1
description
Transcript of Cloudera Developer Kit (CDK)
11
Headline Goes HereSpeaker Name or Subhead Goes Here
Cloudera Developer Kit:Hadoop Application Development Made Easier
E. Sammer | Engineering ManagerBig Data Gurus - 2013/09/16 - @esammer
22
“[I]t’s not enough to just build a scalable and stable system; the system also has to be easy enough for thousands of internal developers of all types and all skill levels to use.”
http://gigaom.com/data/how-disney-built-a-big-data-platform-on-a-startup-budget/
3
Hadoop is incredibly powerful
3
4
Hadoop is incredibly flexible
4
5
Hadoop is incredibly low-level
5
6
Hadoop is incredibly complex
6
7
A typical system (zoom 100:1)
7
8
A typical system (zoom 10:1)
8
9
A typical system (zoom 5:1)
9
10
What you actually care about
Getting data from A to BUsing it later
10
11
Infrastructure details
Serialization, file formats, and compressionMetadata capture and maintenanceDataset organization and partitioningDurability and delivery guaranteesWell-defined failure semanticsPerformance and health instrumentation
11
12
Cloudera Development Kit
Make Hadoop accessible to the enterprise developerCodify expert patterns and practicesMake the “right thing” easy and obviousAddress the most common cases
Let developers focus on business logical, not infrastructure
12
13
Cloudera Development Kit
An open source set of libraries, guides, and examples for building data-oriented systems and applicationsProvides higher level APIs atop existing components of CDHSupports piecemeal adoption via loosely coupled modules
13
14
CDK Data Module
High level APIs for interacting with datasets in HDFSConfiguration-based format and schema managementConsistent data model and serialization semanticsMetadata system integration and supportAutomatic dataset partitioning and file management
14
1515
DatasetRepository repo = new FileSystemDatasetRepository.Builder() .fileSystem(FileSystem.get(new Configuration())) .directory(new Path(“/data”)) .get();
Dataset events = repo.create(“events”, new DatasetDescriptor.Builder() .schema(new File(“event.avsc”)) .partitionStrategy( new PartitionStrategy.Builder().hash(“userId”, 53).get() ).get());
DatasetWriter<GenericRecord> writer = events.getWriter();writer.open();writer.write( new GenericRecordBuilder(schema) .set(“userId”, 1) .set(“timeStamp”, System.currentTimeMillis()) .build());writer.close();
/data /events /.metadata /schema.avsc /descriptor.properties /userId=0 /10000000.avro /10000001.avro /userId=1 /20000000.avro /userId=2 /30000000.avro
Code
Data
16
CDK Morphlines Module
Pluggable, configuration-driven data transform libraryBorn out of Cloudera Search, but general purposeConfigure record transform stages in a container libraryUse the library in Flume, MapReduce jobs, Storm, and other Java applications
14
17
Other Modules
Maven pluginPackage, deploy, and execute “apps”Execute dataset operations
ExamplesPOJO, generic, and generated entity ingestDataset administrative operationsCrunch and MR integration...
14
18
Future
HBaseExtending data APIs to support random accessSame automatic serialization, schema management, etc.
Higher-order data managementCommon tasksThink background compaction, conversion, etc.
Integration with existing middleware frameworksGive us all your good ideas (and code)!
14
19
Getting started
CDK code repo: github.com/cloudera/cdkCDK example repo: github.com/cloudera/cdk-examplesBinary artifacts available from Cloudera’s Maven repositoryCommunity forums: community.cloudera.comMailing list: groups.google.com/a/cloudera.org/d/forum/cdk-devJIRA: issues.cloudera.org/browse/CDK
17
20
Questions?
I also wrote a book.We’re going to give a few copies away.
17
2118