Hadoop in the Cloud: Real World Lessons from Enterprise Customers

44
Hadoop in the Cloud: Real World Lessons from Enterprise Customers Hadoop Summit Dublin April 2016

Transcript of Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Page 1: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Hadoop Summit DublinApril 2016

Page 2: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Survey

Page 3: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Session Objectives and TakeawaysSession ObjectivesEnterprise customer case studiesUnderstand the advantages of using Hadoop on cloudDiscuss the common challenges of using Hadoop on the cloud

Key TakeawaysMost Hadoop vendors and cloud providers have solution templates to help you tackle cloud migration challengesPick a Hadoop distribution and cloud provider on overall strength of analytics portfolio

Page 4: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

OutlineHadoop on Azure OfferingsEnterprise Customer Case StudiesWhy Hadoop on Cloud?Challenges that customers face with Hadoop on cloudQ&A

Page 5: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Hadoop on Azure Offerings

Page 6: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Storage

Microsoft Hadoop StackHadoop Distributions running in Azure VMs

Azure HDInsight

ScriptPig

SQLHive

NoSQL

Hbase

Real-time Storm

Batch

Map reduce

In Memory Spark

Machine Learning R Server

Local (HDFS) or Cloud (Azure Blob/Azure Data Lake Store)

Analytics

Page 7: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Azure HDInsight Hadoop Meets the Cloud

Microsoft’s managed Hadoop as a Service100% open source Apache HadoopBuilt on the latest releases across Hadoop (2.7)Up and running in minutes with no hardware to deployRun on Windows or LinuxSupported by Microsoft

Page 8: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Customer Case Studies

Page 9: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Rockwell Automation is partnered with one of the six oil and gas super majors to build unmanned internet-connected gas dispensers. Each dispenser emits real-time management metrics allowing them to detect anomalies and predict when proactive maintenance needs to occur.

Store sensor data every 5 minutes Temperature, pressure, vibration, etc. Tens of thousands of data points / second

Data Factory

Azure Blobs

Azure HDInsight

Hive, Pig,

Azure SQL DB

Power BI for O365

Mobile Notification Hub

Mobile Device

Real-time notification

Page 10: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

JustGiving wanted to harness the power of their data by using network science to map people’s connections and relationships so that they could connect people with the causes they care about. Based on 15 years of data, the JustGiving GiveGraph is the world’s largest ecosystem of givingbehavior. It contains more than 81 million person

nodes, thousands of causes and 285 million connections and is the engine that drives JustGiving’s social platform, enabling levels of personalization and engagement that a traditional infrastructure would be unable to deliver.

SQL ServerOn-premises

Agent

Azure BlobsAzure HDInsight

Give Graph

Azure Tables

Web APIWebsite +Event store

Service Bus

Real-time Event

Serves results

Azure Cache

ActivityFeeds

Page 11: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

One of the leaders in the development and management of renewable energy infrastructure and services needed to understand data coming from their wind turbines/wind farms in an Internet of Things (IoT) scenario.

100s of windfarms across the globe Each windfarm has 100+ turbines Each turbine generates 10 data points every

25 milliseconds.

Initial goal:Provide consumption related analytics to their customers (power companies)

What else could they do with all that data?Predictive maintenance

How?Event Hub, Azure Storage, HDInsightAzure SQL DB, Excel reporting

Page 12: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Why Hadoop on Cloud?

Page 13: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Why Hadoop on Cloud?Cost savingsAgilityElasticityIntegration with other Cloud ServicesChoice of Deployment Models

Page 14: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Cost SavingsNo hardware licenses or service-specific support agreementPay only for what you use, when you need it, not more than you needIndependently scale storage and compute

No need to hire specialized operations team to do big data

63% lower total cost of ownership than on-premises**Pending IDC study found on a per TB basis, Microsoft customers using cloud-based Hadoop in Data Lake have a 63% lower TCO than on-premises

Page 15: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

AgilityUp and running in minutesHadoop cluster on the cloud can be up and running in minutes

No cluster management neededAll bits and services automatically deployed by Azure HDInsight

Enterprise level supportFully supported by Microsoft and Hortonworks

Page 16: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

ElasticityScale upHDInsight offers various 11 VM instance typesBetter VM instance = more parallelism and/or more CPU/memory

Scale outChoose custom number of instance typesMore worker nodes = more parallelism

Page 17: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Integration with other cloud services

Cortana Analytics SuiteUse the rich analytical services in Azure to build your entire pipeline

Page 18: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Cloud Deployment Models

Why use Cluster as a Service?Pay only for time the cluster was actually usedSince both data and metadata is persisted, experience is as if the cluster was never deleted

Always on cluster Cluster as a serviceStorage choice Local HDFS, Azure Blob,

Azure Data Lake StoreAzure Blob, Azure Data Lake Store

Job Scheduling Oozie Azure Data FactoryData persistence after cluster deletion

N/A Azure Blob, Azure Data Lake Store

Metadata persistence after cluster deletion

N/A Azure SQL

Page 19: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Common Challenges and Solutions for Hadoop on Cloud

Page 20: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Common challenges with Hadoop on CloudScaling cloud storage for big workloadsData and Metadata Migration from On-prem to CloudExtending Hadoop to third party appsSecurity and ComplianceIntegration with Enterprise tools

Page 21: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Scaling cloud storage for big workloads

PartitioningPartitioned data on Year, Month, Day

ProblemSimultaneous Read/Write caused I/O bottleneck

Partition 1 Partition 2 Partition 3

2014-10.part0

2014-11.part0

2014-12.part0

Traditional Cloud Store

2014-10.part1

2014-11.part1

2014-12.part1

2014-10.part2

2014-11.part2

2014-12.part2

2014-10.part3

2014-11.part3

2014-12.part3

Page 22: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Scaling cloud storage for big workloads

Partitioning per AccountPut each partition in its own account

ProblemDue to partition pruning, each query will still go to same account, still causing throughput bottlenecks

Partition 1 Partition 2 Partition 3

2014-10.part0

2014-11.part0

2014-12.part0

2014-10.part1

2014-11.part1

2014-12.part1

2014-10.part2

2014-11.part2

2014-12.part2

2014-10.part3

2014-11.part3

2014-12.part3

Traditional Cloud Store

1Traditional

Cloud Store 2Traditional Cloud Store

3

Page 23: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Scaling cloud storage for big workloadsSolutionKeep files of each partition across multiple storage accounts

Encode knowledge of physical location into logical partitioning key

Partition 1 Partition 2 Partition 3

2014-10.part0

2014-10.part1

2014-10.part2

Traditional Cloud Store

1

2014-11.part0

2014-11.part1

2014-11.part2

2014-12.part0

2014-12.part1

2014-12.part2

Traditional Cloud Store 2

Traditional Cloud Store

32014-

10.part32014-

11.part32014-

12.part3

Traditional Cloud Store

4

Partition 4

Page 24: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Azure Data Lake Store: Improving cloud store limitsNo limits on file sizes Analytics scale on demandNo code rewrites as you increase size of data stored Optimized for massive throughputOptimized for IOT with high volume of small writes

PBTB GB

PBTB

Page 25: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Hybrid model: Data and Metadata synchronized

GoalsHow to have minimal downtime while migrating cluster to cloud?How to move both data and metadata?How to setup mirroring, i.e. constant replication?

Page 26: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Hybrid model: Data and Metadata synchronizedData SynchronizationHortonworks and Microsoft together released Falcon with Azure Data Factory connectorAllows constant replication of data between on-prem and cloud

Metadata SynchronizationFor true cluster replication, metadata also needs to be replicated in addition to dataYou can configure on-prem cluster to use SQL Server and use AlwaysOn Availability Groups feature to replicate metadata between On-Prem and Cloud

Page 27: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Extending Hadoop to your on-prem resources

Use Azure VNet feature to extend HDInsight to your on-prem network

Page 28: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Hadoop Extensibility: Installing own applicationsLinux WorkloadsTraditionally, HDInsight used to run on Windows, but with Linux customers can run more open source applications

ScriptActionYou can create custom Bash scripts that can be provided during cluster creation or already running cluster to install other applications

VNet and Edge NodesAn edge node can be created in an HDInsight cluster within a VNet to run more applications

Page 29: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Using ISV solutionsScenarioHadoop has a rich ecosystem of appsCustomers want to use apps beyond those provided by out of box

Why use ISV applications?Provide more features than those available in HadoopWSIWYG Query Designer ToolsOLAP BI Capabilities over Hadoop clusterFine grained access controlDrag and Drop data pipeline design and orchestration

Page 30: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

ISV apps: DatameerDatameerWYSIWIG Query Designer in an Excel-like InterfaceSchedule recurring jobsEasily share projects with other analysts/data engineers in your company

Page 31: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

ISV apps: AtScaleAtScaleAtScale is an OLAP engine purpose-built for Hadoop. It leverages the latest advancements in the Hadoop ecosystem to support existing BI workloads. • Multiple SQL-on-Hadoop Engine

Support• Access Data Where it LaysBuilt-

in• Support for Complex Data Types• Single Drop-in Gateway Node

Deployment

Page 32: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

ISV apps: CaskCaskBuild pipeline using Drag & DropSource connections from on prem relational databases, or cloud stores for big data into HDInsight/Data Lake StorageCommon data pipeline task libraryFree, open source license to get started, enterprise option for dedicated use

Page 33: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Azure Security: Encryption At RestAzure Blob Storage (In Preview)• Encryption @ rest using Microsoft managed keys• Customers can use Azure Storage configuration to manage

encryption. No HDInsight changes required.

Page 34: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

RBAC : Securing HDInsight with Blue Talon (ISV)

Multi-user access and fine-grained authorization policies for Hive TablesRow & column level security, data masking etc.

Page 35: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Integration with Enterprise ToolsCustomers want variety of tools for their end usersHDInsight provides query authoring with Hue, Ambari ViewsSupports Jupyter out of box and Zeppelin with ScriptActionQuery authoring support using Visual StudioFirst class Scala/Java support for Spark apps using IntelliJ

Page 36: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Be productive with a robust development environment

Deep integration to Visual StudioEasy for novices to write simple queriesRobust environment for experts to also be productiveIntegrated with Pig, Hive, and StormPlayback that visualizes performance to identify bottlenecks and areas for optimization

Productive for novices and experts

Page 37: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Microsoft Makes Hadoop EasierDeep Visual Studio IntegrationDebug Hive jobs through Yarn logs or troubleshoot Storm topologiesVisualize Hadoop clusters, tables, and storageSubmit Hive queries, Storm topologies (C# or Java spouts/bolts)IntelliSense for authoring Hive jobs and Storm business logic

Page 38: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Great authoring experience: full

IntelliSense support (this tool can also fetch remote metadata for suggestion so users don’t need to

remember a lot of DB/Table names)

Integrated with Visual Studio project system so

users can do version control easily there

Page 39: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Show the DAG graphs for Hive on Tez job (with more

details in the tooltip)

Show associated query

Page 40: Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Page 41: Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Page 42: Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Page 43: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Session Objectives and TakeawaysSession ObjectivesUnderstand the advantages of using Hadoop on cloudDiscuss the common problems of using Hadoop on the cloud

Key TakeawaysMost Hadoop vendors and cloud providers have solution templates to help you tackle cloud migration challengesPick a Hadoop distribution and cloud provider on overall strength of analytics portfolio

Page 44: Hadoop in the Cloud: Real World Lessons from Enterprise Customers

Q&A