Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing...

Post on 11-Aug-2014

460 views 6 download

description

 

Transcript of Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing...

Big Data ComputationsUsing Elastic DataProcessing inOpenStack Cloud

Sergey Lukjanov (Mirantis)Alexander Ignatov (Mirantis)Trevor McKay (Red Hat)

Agenda

• OpenStack Data Processing Overview

• EDP Architecture & Technical Concepts

• Live Demo

Agenda

• OpenStack Data Processing Overview

• EDP Architecture & Technical Concepts

• Live Demo

OpenStack Data Processing: Sahara

Mission: To provide a scalable data processing stack and associated management interfaces.

• provision and operate Hadoop clusters • schedule and operate Hadoop jobs

Hadoop - Big Data Platform

© http://hortonworks.com/hadoop/yarn/

Trends

http://www.google.com/trends/

Architecture overview

Data Sources

Savanna Python Client RE

ST A

PI

Cluster Configuration

Manager

Horizon

Keystone

Auth

Data Access Layer

Swift

Savanna Pages

HadoopVM

Vendors Plugins

HadoopVM

HadoopVM

HadoopVM

Resources Orchestration

Manager

Job Sources Job

Manager

Heat

Nova

Glance

Cinder

Neutron

Trove DB

Sahara status

• Official integrated OpenStack project• Supported Hadoop distros:

• Vanilla Apache Hadoop• Hortonworks Data Platform• Intel Distribution• Cloudera Distribution in blueprint

• Included into OpenStack distros:• RDO - openstack.redhat.com• Mirantis OpenStack - software.mirantis.com

Contributors

Agenda

• OpenStack Data Processing Overview

• EDP Architecture & Technical Concepts

• Live Demo

Elastic Data Processing

• EDP - API for executing MapReduce jobs on Hadoop clusters (similar to AWS EMR)• Supported data sources: Swift, HDFS, Ceph• Supported job types: Java actions,

MapReduce, MapReduce.Streaming, Pig, Hive• Oozie for Hadoop jobs workflow management

• Supports both Hadoop 1 & 2• Job executions on transient clusters

EDP Use Cases

• Simplified task executions. You don’t need to know Hadoop!

• Bursty workload: ad-hoc queries requiring a significant resource only for short time period

• Utilization of free IaaS capacity for Hadoop tasks

EDP - Data Sources

Swift Sahara EDP

INPUT

OUTPUT

HadoopVM

HadoopVM

HadoopVM

HadoopVM

swift://some_container/INPUT

swift://some_container/OUTPUT

EDP - Job Binaries

Swift

Sahara DB

Sahara EDP

internal-db://script.pig

swift://some_container/mapreduce.jar

1. Pig, Hive scripts2. Executable Jar files3. Pluggable binaries and

libraries

EDP - Job Execution. Step 1

Sahara

SwiftINPUT

DB: Jar, Pig

EDP

Jar, Pig

EDP - Job Execution. Step 2

Sahara

SwiftINPUT

DB: Jar, Pig

EDP

Jar, Pig

JobTracker

Oozie

HadoopVM

HadoopVM

HadoopVM

EDP - Job Execution. Step 3

Sahara

SwiftINPUT

DB: Jar, Pig

EDP

Jar, Pig

HadoopVM

HadoopVM

HadoopVM

JobTracker

OozieExecute a job

EDP - Job Execution. Step 4

Sahara

SwiftINPUT

DB: Jar, Pig

EDP

Jar, Pig

HadoopVM

HadoopVM

HadoopVM

JobTracker

Oozie

EDP - Job Execution. Step 5

Sahara

SwiftINPUT

DB: Jar, Pig

EDP

Jar, Pig

HadoopVM

HadoopVM

HadoopVM

workflow.xm

l

1. Job-specific configurations

2. URLs to binaries

3. URLs for data sources

4. Credentials

JobTracker

Oozie

EDP - Job Execution. Step 6

Sahara

SwiftINPUT

DB: Jar, Pig

EDP

Jar, Pig

HadoopVM

HadoopVM

HadoopVM

workflow.xm

l

Data Processing

OUTPUT

1. Job-specific configurations

2. URLs to binaries

3. URLs for data sources

4. Credentials

JobTracker

Oozie

EDP - Job Execution. Step 7

Sahara

SwiftINPUT

DB: Jar, Pig

EDP

Jar, Pig

HadoopVM

HadoopVM

HadoopVM

workflow.xm

l

1. Job-specific configurations

2. URLs to binaries

3. URLs for data sources

4. Credentials

Data Processing

OUTPUT

JobTracker

Oozie

Agenda

• OpenStack Data Processing Overview

• EDP Architecture & Technical Concepts

• Live Demo

EDP BigPetStore Demo

BigPetStore is now part of Apache BigTop• Test/demo laboratory for all things Hadoop

• Actively developed with integration testing

• Generates and processes data of arbitrary size

• git clone git://git.apache.org/bigtop.git

• Filed under bigtop/bigtop-bigpetstore

EDP BigPetStore Demo

What are we going to do?

• Generate 1M records of pet supply purchases• Clean the data (“dirty CSV”)• Extract cumulative counts by state• Demonstrates Sahara EDP objects

• Job Binaries• Jobs (Java and Pig)• Data Sources

EDP BigPetStore Sample Data

Generated Data (first job)

$ hadoop fs -cat bigpetstore/gen/part-r-00000 | more

BigPetStore,storeCode_AK,1 deanna,booker,Sun Jan 18 20:50:06 GMT+00:00 1970,7.5,cat-food

BigPetStore,storeCode_AK,10 erica,buck,Thu Dec 25 16:29:28 GMT+00:00 1969,10.5,dog-food

Cleaned Data (second job)

$ hadoop fs -cat bigpetstore/clean/part-m-00000 | more

BigPetStore storeCode_AK 1 deanna booker Sun Jan 18 20:50:06 GMT+00:00 1970 7.5 cat-food

BigPetStore storeCode_AK 10 erica buck Thu Dec 25 16:29:28 GMT+00:00 1969 10.5 dog-food

EDP BigPetStore Sample Data

Summed Data For Products by State (3rd job)

$ hadoop fs -cat bigpetstore/analyze_rel/part-r-00000 | more

US-AK cat-food 24837

US-AK dog-food 24994

US-AK fuzzy-collar 25145

US-AK antelope-caller 25024

US-AZ cat-food 25106

US-AZ dog-food 25064

US-AZ leather-collar 24870

US-AZ snake-bite ointment 24960

What Next for EDP

Potential Areas for Development within EDP

• Pluggable Job Execution Model• Allows Sahara to run jobs with additional execution engines• Current Oozie offerings become one of multiple options

• Expand Capabilities via Oozie• Support upload of user-written Oozie workflows• Support for coordinated jobs

• Enhanced Usability• Better Error Reporting• User Experience (UI, CLI, API)

Please, send us your feedback! Ideas are always welcome• #openstack-sahara on freenode• openstack-dev@lists.openstack.org with [openstack-dev][sahara] subject

Design Summit Sessions

7 Sessions: Thursday 1:30 - Friday 10:30

http://goo.gl/lQXtUS

Q&A

Thank you!