Cloudera Developer Kit (CDK)

11

Headline Goes HereSpeaker Name or Subhead Goes Here

Cloudera Developer Kit:Hadoop Application Development Made Easier

E. Sammer | Engineering ManagerBig Data Gurus - 2013/09/16 - @esammer

22

“[I]t’s not enough to just build a scalable and stable system; the system also has to be easy enough for thousands of internal developers of all types and all skill levels to use.”

http://gigaom.com/data/how-disney-built-a-big-data-platform-on-a-startup-budget/

3

Hadoop is incredibly powerful

3

4

Hadoop is incredibly flexible

4

5

Hadoop is incredibly low-level

5

6

Hadoop is incredibly complex

6

7

A typical system (zoom 100:1)

7

8


8

9


9

10

What you actually care about

Getting data from A to BUsing it later

10

11

Infrastructure details

Serialization, file formats, and compressionMetadata capture and maintenanceDataset organization and partitioningDurability and delivery guaranteesWell-defined failure semanticsPerformance and health instrumentation

11

12

Cloudera Development Kit

Make Hadoop accessible to the enterprise developerCodify expert patterns and practicesMake the “right thing” easy and obviousAddress the most common cases

Let developers focus on business logical, not infrastructure

12

13

Cloudera Development Kit

An open source set of libraries, guides, and examples for building data-oriented systems and applicationsProvides higher level APIs atop existing components of CDHSupports piecemeal adoption via loosely coupled modules

13

14

CDK Data Module

High level APIs for interacting with datasets in HDFSConfiguration-based format and schema managementConsistent data model and serialization semanticsMetadata system integration and supportAutomatic dataset partitioning and file management

14

1515

DatasetRepository repo = new FileSystemDatasetRepository.Builder() .fileSystem(FileSystem.get(new Configuration())) .directory(new Path(“/data”)) .get();

Dataset events = repo.create(“events”, new DatasetDescriptor.Builder() .schema(new File(“event.avsc”)) .partitionStrategy( new PartitionStrategy.Builder().hash(“userId”, 53).get() ).get());

DatasetWriter<GenericRecord> writer = events.getWriter();writer.open();writer.write( new GenericRecordBuilder(schema) .set(“userId”, 1) .set(“timeStamp”, System.currentTimeMillis()) .build());writer.close();

/data /events /.metadata /schema.avsc /descriptor.properties /userId=0 /10000000.avro /10000001.avro /userId=1 /20000000.avro /userId=2 /30000000.avro

Code

Data

16

CDK Morphlines Module

Pluggable, configuration-driven data transform libraryBorn out of Cloudera Search, but general purposeConfigure record transform stages in a container libraryUse the library in Flume, MapReduce jobs, Storm, and other Java applications

14

17

Other Modules

Maven pluginPackage, deploy, and execute “apps”Execute dataset operations

ExamplesPOJO, generic, and generated entity ingestDataset administrative operationsCrunch and MR integration...

14

18

Future

HBaseExtending data APIs to support random accessSame automatic serialization, schema management, etc.

Higher-order data managementCommon tasksThink background compaction, conversion, etc.

Integration with existing middleware frameworksGive us all your good ideas (and code)!

14

19

Getting started

CDK code repo: github.com/cloudera/cdkCDK example repo: github.com/cloudera/cdk-examplesBinary artifacts available from Cloudera’s Maven repositoryCommunity forums: community.cloudera.comMailing list: groups.google.com/a/cloudera.org/d/forum/cdk-devJIRA: issues.cloudera.org/browse/CDK

17

20

Questions?

I also wrote a book.We’re going to give a few copies away.

17

Cloudera Developer Kit (CDK)

Technology

Transcript of Cloudera Developer Kit (CDK)