Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing...

Big Data ComputationsUsing Elastic DataProcessing inOpenStack Cloud

Sergey Lukjanov (Mirantis)Alexander Ignatov (Mirantis)Trevor McKay (Red Hat)

Agenda

• OpenStack Data Processing Overview

• EDP Architecture & Technical Concepts

• Live Demo

Agenda

• Live Demo

OpenStack Data Processing: Sahara

Mission: To provide a scalable data processing stack and associated management interfaces.

• provision and operate Hadoop clusters • schedule and operate Hadoop jobs

Hadoop - Big Data Platform

Trends

http://www.google.com/trends/

Architecture overview

Data Sources

Savanna Python Client RE

Cluster Configuration

Manager

Horizon

Keystone

Data Access Layer

Savanna Pages

HadoopVM

Vendors Plugins

HadoopVM

Resources Orchestration

Manager

Job Sources Job

Manager

Glance

Cinder

Neutron

Trove DB

Sahara status

• Official integrated OpenStack project• Supported Hadoop distros:

• Vanilla Apache Hadoop• Hortonworks Data Platform• Intel Distribution• Cloudera Distribution in blueprint

• Included into OpenStack distros:• RDO - openstack.redhat.com• Mirantis OpenStack - software.mirantis.com

Contributors

Agenda

• Live Demo

Elastic Data Processing

• EDP - API for executing MapReduce jobs on Hadoop clusters (similar to AWS EMR)• Supported data sources: Swift, HDFS, Ceph• Supported job types: Java actions,

MapReduce, MapReduce.Streaming, Pig, Hive• Oozie for Hadoop jobs workflow management

• Supports both Hadoop 1 & 2• Job executions on transient clusters

EDP Use Cases

• Simplified task executions. You don’t need to know Hadoop!

• Bursty workload: ad-hoc queries requiring a significant resource only for short time period

• Utilization of free IaaS capacity for Hadoop tasks

EDP - Data Sources

Swift Sahara EDP

OUTPUT

HadoopVM

swift://some_container/INPUT

swift://some_container/OUTPUT

EDP - Job Binaries

Sahara DB

Sahara EDP

internal-db://script.pig

swift://some_container/mapreduce.jar

1. Pig, Hive scripts2. Executable Jar files3. Pluggable binaries and

libraries

EDP - Job Execution. Step 1

Sahara

SwiftINPUT

DB: Jar, Pig

Jar, Pig

Sahara

SwiftINPUT

DB: Jar, Pig

Jar, Pig

JobTracker

HadoopVM

Sahara

SwiftINPUT

DB: Jar, Pig

Jar, Pig

HadoopVM

JobTracker

OozieExecute a job

Sahara

SwiftINPUT

DB: Jar, Pig

Jar, Pig

HadoopVM

JobTracker

Sahara

SwiftINPUT

DB: Jar, Pig

Jar, Pig

HadoopVM

workflow.xm

1. Job-specific configurations

2. URLs to binaries

3. URLs for data sources

4. Credentials

JobTracker

Sahara

SwiftINPUT

DB: Jar, Pig

Jar, Pig

HadoopVM

workflow.xm

Data Processing

OUTPUT

2. URLs to binaries

4. Credentials

JobTracker

Sahara

SwiftINPUT

DB: Jar, Pig

Jar, Pig

HadoopVM

workflow.xm

2. URLs to binaries

4. Credentials

Data Processing

OUTPUT

JobTracker

Agenda

• Live Demo

EDP BigPetStore Demo

BigPetStore is now part of Apache BigTop• Test/demo laboratory for all things Hadoop

• Actively developed with integration testing

• Generates and processes data of arbitrary size

• git clone git://git.apache.org/bigtop.git

• Filed under bigtop/bigtop-bigpetstore

EDP BigPetStore Demo

What are we going to do?

• Generate 1M records of pet supply purchases• Clean the data (“dirty CSV”)• Extract cumulative counts by state• Demonstrates Sahara EDP objects

• Job Binaries• Jobs (Java and Pig)• Data Sources

EDP BigPetStore Sample Data

Generated Data (first job)

$ hadoop fs -cat bigpetstore/gen/part-r-00000 | more

BigPetStore,storeCode_AK,1 deanna,booker,Sun Jan 18 20:50:06 GMT+00:00 1970,7.5,cat-food

BigPetStore,storeCode_AK,10 erica,buck,Thu Dec 25 16:29:28 GMT+00:00 1969,10.5,dog-food

Cleaned Data (second job)

$ hadoop fs -cat bigpetstore/clean/part-m-00000 | more

BigPetStore storeCode_AK 1 deanna booker Sun Jan 18 20:50:06 GMT+00:00 1970 7.5 cat-food

BigPetStore storeCode_AK 10 erica buck Thu Dec 25 16:29:28 GMT+00:00 1969 10.5 dog-food

EDP BigPetStore Sample Data

Summed Data For Products by State (3rd job)

$ hadoop fs -cat bigpetstore/analyze_rel/part-r-00000 | more

US-AK cat-food 24837

US-AK dog-food 24994

US-AK fuzzy-collar 25145

US-AK antelope-caller 25024

US-AZ cat-food 25106

US-AZ dog-food 25064

US-AZ leather-collar 24870

US-AZ snake-bite ointment 24960

What Next for EDP

Potential Areas for Development within EDP

• Pluggable Job Execution Model• Allows Sahara to run jobs with additional execution engines• Current Oozie offerings become one of multiple options

• Expand Capabilities via Oozie• Support upload of user-written Oozie workflows• Support for coordinated jobs

• Enhanced Usability• Better Error Reporting• User Experience (UI, CLI, API)

Please, send us your feedback! Ideas are always welcome• #openstack-sahara on freenode• openstack-dev@lists.openstack.org with [openstack-dev][sahara] subject

Design Summit Sessions

7 Sessions: Thursday 1:30 - Friday 10:30

http://goo.gl/lQXtUS

Thank you!

Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing...

Data & Analytics

Transcript of Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using Elastic Data Processing...

The Complexity of Massive Data Set Computations

Data-Flow Algorithms for Parallel Matrix Computations · data-flow algorithms for matrix computations might be implemented suggests architectural features that would be desirable

Data Analysis and Matrix Computations Assignment 6math.xmu.edu.cn/group/nona/damc/assignment06.pdf · Data Analysis and Matrix Computations Assignment 6 Exercise 1. (15 points) 1

Trustworthy Distributed Computations on Personal Data ... · Trustworthy Distributed Computations on Personal Data Using Trusted Execution Environments Riad Ladjel Inria, UVSQ, France

Distributed Data - Centralized Policy - openstack-tage.de

[OpenStack Day in Korea 2015] Keynote 2 - Leveraging OpenStack to Realize the SKT Software-defined Data Center

Data-Flow Algorithms for Parallel Matrix Computationsoleary/reprints/j19.pdf · Parallel Matrix Computations ... data-flow algorithms for matrix computations might be ... August 1985

Data Layout Transformation for Stencil Computations on - UCLA

OpenStack in Action 4! Daniel Pays & Régis Allègre - Cloudwatt Public Cloud: OpenStack secured and Big Data enabled

Homomorphic Encryption for Secure Data Computations

Engineering Big Data Infra with Openstack

TrilioVault for OpenStack Cloud · TrilioVault for OpenStack Cloud ENTERPRISE-GRADE, HYBRID CLOUD DATA PROTECTION TrilioVault is the only OpenStack-native data backup and recovery

Dell EMC OpenStack Data Protection Extension Installation and … · 2020-07-09 · The OpenStack Data Protection Extension (OpenStack DPE) allows backup administrators to manage

IoT Platform OpenStack Summit Use Case · DATA CENTER OpenStack Cloud Data processing Data processing Data processing . HIGH LEVEL ARCHITECTURE SENSORS IQRF IQRF GATEWAYS e docker

Lower bounds on data stream computations

Data Protection in OpenStack - Huodongjia.com

Cloud data center and openstack

Runtime Data Flow Graph Scheduling of Matrix Computations

A Service for Data-Intensive Computations

Implementation of OpenStack and CDMI data storage ...