Cloudera Developer Kit (CDK)

21
1 1 Headline Goes Here Speaker Name or Subhead Goes Here Cloudera Developer Kit: Hadoop Application Development Made Easier E. Sammer | Engineering Manager Big Data Gurus - 2013/09/16 - @esammer

description

Eric Sammer from Cloudera talks about Cloudera Developer Kit

Transcript of Cloudera Developer Kit (CDK)

Page 1: Cloudera Developer Kit (CDK)

11

Headline Goes HereSpeaker Name or Subhead Goes Here

Cloudera Developer Kit:Hadoop Application Development Made Easier

E. Sammer | Engineering ManagerBig Data Gurus - 2013/09/16 - @esammer

Page 2: Cloudera Developer Kit (CDK)

22

“[I]t’s not enough to just build a scalable and stable system; the system also has to be easy enough for thousands of internal developers of all types and all skill levels to use.”

http://gigaom.com/data/how-disney-built-a-big-data-platform-on-a-startup-budget/

Page 3: Cloudera Developer Kit (CDK)

3

Hadoop is incredibly powerful

3

Page 4: Cloudera Developer Kit (CDK)

4

Hadoop is incredibly flexible

4

Page 5: Cloudera Developer Kit (CDK)

5

Hadoop is incredibly low-level

5

Page 6: Cloudera Developer Kit (CDK)

6

Hadoop is incredibly complex

6

Page 7: Cloudera Developer Kit (CDK)

7

A typical system (zoom 100:1)

7

Page 8: Cloudera Developer Kit (CDK)

8

A typical system (zoom 10:1)

8

Page 9: Cloudera Developer Kit (CDK)

9

A typical system (zoom 5:1)

9

Page 10: Cloudera Developer Kit (CDK)

10

What you actually care about

Getting data from A to BUsing it later

10

Page 11: Cloudera Developer Kit (CDK)

11

Infrastructure details

Serialization, file formats, and compressionMetadata capture and maintenanceDataset organization and partitioningDurability and delivery guaranteesWell-defined failure semanticsPerformance and health instrumentation

11

Page 12: Cloudera Developer Kit (CDK)

12

Cloudera Development Kit

Make Hadoop accessible to the enterprise developerCodify expert patterns and practicesMake the “right thing” easy and obviousAddress the most common cases

Let developers focus on business logical, not infrastructure

12

Page 13: Cloudera Developer Kit (CDK)

13

Cloudera Development Kit

An open source set of libraries, guides, and examples for building data-oriented systems and applicationsProvides higher level APIs atop existing components of CDHSupports piecemeal adoption via loosely coupled modules

13

Page 14: Cloudera Developer Kit (CDK)

14

CDK Data Module

High level APIs for interacting with datasets in HDFSConfiguration-based format and schema managementConsistent data model and serialization semanticsMetadata system integration and supportAutomatic dataset partitioning and file management

14

Page 15: Cloudera Developer Kit (CDK)

1515

DatasetRepository repo = new FileSystemDatasetRepository.Builder() .fileSystem(FileSystem.get(new Configuration())) .directory(new Path(“/data”)) .get();

Dataset events = repo.create(“events”, new DatasetDescriptor.Builder() .schema(new File(“event.avsc”)) .partitionStrategy( new PartitionStrategy.Builder().hash(“userId”, 53).get() ).get());

DatasetWriter<GenericRecord> writer = events.getWriter();writer.open();writer.write( new GenericRecordBuilder(schema) .set(“userId”, 1) .set(“timeStamp”, System.currentTimeMillis()) .build());writer.close();

/data /events /.metadata /schema.avsc /descriptor.properties /userId=0 /10000000.avro /10000001.avro /userId=1 /20000000.avro /userId=2 /30000000.avro

Code

Data

Page 16: Cloudera Developer Kit (CDK)

16

CDK Morphlines Module

Pluggable, configuration-driven data transform libraryBorn out of Cloudera Search, but general purposeConfigure record transform stages in a container libraryUse the library in Flume, MapReduce jobs, Storm, and other Java applications

14

Page 17: Cloudera Developer Kit (CDK)

17

Other Modules

Maven pluginPackage, deploy, and execute “apps”Execute dataset operations

ExamplesPOJO, generic, and generated entity ingestDataset administrative operationsCrunch and MR integration...

14

Page 18: Cloudera Developer Kit (CDK)

18

Future

HBaseExtending data APIs to support random accessSame automatic serialization, schema management, etc.

Higher-order data managementCommon tasksThink background compaction, conversion, etc.

Integration with existing middleware frameworksGive us all your good ideas (and code)!

14

Page 19: Cloudera Developer Kit (CDK)

19

Getting started

CDK code repo: github.com/cloudera/cdkCDK example repo: github.com/cloudera/cdk-examplesBinary artifacts available from Cloudera’s Maven repositoryCommunity forums: community.cloudera.comMailing list: groups.google.com/a/cloudera.org/d/forum/cdk-devJIRA: issues.cloudera.org/browse/CDK

17

Page 20: Cloudera Developer Kit (CDK)

20

Questions?

I also wrote a book.We’re going to give a few copies away.

17

Page 21: Cloudera Developer Kit (CDK)

2118