A New "Sparkitecture" for modernizing your data warehouse

19
A New “Sparkitecture” for Modernizing Your Data Warehouse Ranga Nathan (Big Data Solutions Product Management) Jack Gudenkauf (Big Data Professional Services Architect) June 9, 2016

Transcript of A New "Sparkitecture" for modernizing your data warehouse

Discover 2016 PowerPoint template overview

A New Sparkitecture for Modernizing Your Data WarehouseRanga Nathan (Big Data Solutions Product Management)Jack Gudenkauf (Big Data Professional Services Architect)June 9, 2016

This is a sample Title Slide with Picture ideal for including a dark picture with a brief title and subtitle.

A selection of pre-approved title slides are available in the HPE Title Slide Library. The location of the library will be communicated later.

To insert a slide with a different picture from the HPE Title Slide Library:

Open the file HPE_16x9_Title_Slide_Library.pptx

From the Slide thumbnails pane, select the slide with the picture you would like to use in your presentation and click Copy (Ctrl+C)

Open a copy of the new HPE 16x9 template (Standard or Events) or your current presentation

In the Slide thumbnails pane, click Paste (Ctrl+V)

A Paste Options clipboard icon will appear. Click the icon and select Keep Source Formatting. (Ctrl+K)

1

40% of Enterprises struggle to identify, integrate, and Manage Big Data with existing technology

More systems to manageMore complexity to integrateMore data to identify

Barriers:

Its fair to say that most companies recognize the need have the desire to be data driven, but execution is challenging. Wikibons survey in 2015 revealed that enterprises have difficulty integrating Big Data with existing infrastructure, and struggle to identify the technologies to use . Its the irony of having so much data, but struggling to make the most of it, so why?

1st problem is silos and lack of alignment Organizations are drowning in a sea of data, but they dont know what to do with your data to get value and have silo-ed uncoordinated efforts and science experiments across the org

2nd problem is the technology gapTraditional systems, architectures, and approaches were never designed for todays data, falling short on the ability to handle the scale, the speed, or the variety of data, and deliver insights fast enough to meet business needs or they are just too cost prohibitive to deploy and manage

3rd problem is the inability to bridge data to valueJust having the right technology tools in place doesnt mean youll get the value of the data. Today theres an operational chasm between the people and tools with value.

So you want to get more out of your existing investments and build for the future, but how?

2

Hadoop and the data lake3

AppAppAppHadoop Data LakeVision:Data-centric foundation for all data and appsElastic data management & compute platform for all data Single platform for all analytical workloads Reality:Data swamps due to lack of oversight and data governance Dearth of skilled resources to extract value from the dataSub-optimal performance with traditional architectures Cannot scale to handle multi-tenant workload complexity

Data Ponds

Hadoop, founded in 2006, was designed to support large scale, distributed data management, storage and data processing for all data structured and unstructured. Its proven to be cost-effective at doing that, effectively solving the challenge of managing all the growing data across their organization.

Vision of the Hadoop data lake all workloads/analytics/applications running on a common data set, flexible/elastic platform for all data and analytical workloads, dont move data to different DBs for different workloadsReality of the Hadoop data lake reality has fallen short of the vision for most organizations. instead of a data lake, you have data pondssilos across BGs, silo-cluster sprawl across your Hadoop environments (1 cluster for MapReduce, 1 for SQL, 1 for Spark, etc), limited workloads and ultimately limited value

While many have been using Hadoop as a data repository, for simple workloads like ETL and pre-processing data, and have been approaching analytics as an afterthought, we often hear from customers saying I wish I could analyze and interact with my data in Hadoop at enterprise-grade performance and reliability and identify the use cases that would create the most value for my organization.

Business needs:* get more value out of Hadoop, uncover more insights from the data* run more workloads* consolidate data and infrastructure* scale Hadoop across a common, elastic, shared infrastructure*

3

Conventional Wisdom Regarding Deploying a Data Lake Infrastructure4

Use Cases:ProLiant DL380

Apollo 4530

Apollo 4200Traditional Hadoop architectureBatch workloads with predictable growth

Lowers Big Data Costs for larger deploymentsMatch compute to workload

Large internal storageIdeal for large data volumes and batch workloads where density or cost per GB is keyDensity-OptimizedTraditionalData Lakes & HubsIngestion of multiple types and sources of dataAggregation, Transformation and VisualizationBatch, Interactive, Real-time workloads

Data Warehouse ModernizationData Staging & landing zoneMigration of operational data storesActive archivingBatch workloads

A Big Data Journey

ETL Offload

Archival

Deep Learning

Event Processing

In Memory Analytics

5

HPE Elastic Platform for AnalyticsFlexible Convergence for Big Data Workloads

Low Latency ComputeEvent ProcessingMoonshot M710PBig Memory ComputeIn Memory AnalyticsApollo xl170r w 512G memoryArchival StorageApollo 4200 w 6TB HDDHigh Latency ComputeETL Offload and ArchivalApollo xl170 w 256G memoryHPC ComputeDeep LearningApollo xl190r w GPUsHDFS StorageApollo 4200 w 3TB HDD

6

Unique ValueHPE Workload- and Density-Optimized (WDO) SolutionHPE Elastic Platform for AnalyticsInnovation delivering unique value to customers and the open source communityData ConsolidationShared storage pool for multiple Big Data environmentsMaximum ElasticityDynamic cluster provisioning from compute pools without repartitioning data Flexible ScalabilityScale compute and storage independentlyBreakthrough EconomicsWorkload optimized components for better density, cost and power

Ethernet

HPE Apollo 4xx0

HPE Moonshot or HPE Apollo

7

Key Takeaway: HP Big Data Reference Architecture is another innovation from HP that leverages the strength of HPs portfolio to deliver value for our customers via a differentiated solution that combines HP Moonshot server and HP Apollo storage servers.

The Value Proposition and the HOWTraditional scale-up infrastructures separate compute and storage for flexibility of scaling them independently, but at the cost of management complexity and costScale-out architectures, and new technologies that use DAS storage within a server, lose this ability to scale independently by combining compute and storage in one box a tradeoff for achieving hyper-scalability and simple managementHP Big Data Reference architecture deploys standard Hadoop distribution in an asymmetric fashion running the storage related components such as Hadoop Distributed File System (HDFS) and Hbase (open source non-relational distributed database) on Apollo Density Optimized servers and compute related components running under Yarn on Moonshot Hyperscale servers. This essentially provides the best-of-both worlds; ability to scale compute and storage independently without losing the benefits of scale-out infrastructure.In order to make this more flexible HP worked with Hortonworks to create a new feature in Hadoop called Yarn Labels innovation that we contributed to Open Source! Yarn Labels allows us to create pools of compute nodes where applications run so it is possible to dynamically provision clusters without repartitioning data (since data can be shared across compute nodes) We can scale compute and storage independently by simply adding compute nodes or storage nodes to scale performance linearly This fundamentally changes the economics of the solution across scale, performance and cost efficiency to meet specific use case and workload needs!7

Building blocks for the HPE Elastic Platform for Analytics8

HP Apollo 2000 System

HP Apollo 4200 Scalable SystemA density optimized compute platform that offers double the density of traditional 1U servers and high memory/core ratiosA cost-effective industry standard storage server purpose built for big data with converged infrastructure that offers high density energy-efficient storage

HP Moonshot SystemA complete server system engineered for specific workloads and delivered in a dense, energy-efficient package

HP ProLiant DL300 System

The industrys most popular server balances the latest compute and memory technologies with internal storage and flash options, coupled with industry-leading management and serviceabilityWorkload-optimized compute nodes for Spark, Hive/Tez, MapReduce, YARN, Vertica SQL on Hadoop, and other analytics and batch workloadWorkload-optimized compute nodes for Hbase, Kafka and other low-latency, streaming workloads. Ideal for highest density + lowest power requirementsWorkload-optimized storage nodes for HDFS, Kudu, and building a multi-temperate data lake environment. Ideal for EDW Offload workloads, and a foundation storage block for workload consolidationBalanced worker node for batch and single-function use cases. Scalable storage node, with smaller fault domain than Apollo 4200, for workload consolidation use cases.

Yes, But Does It Perform?

Compute Nodes

Storage Nodes

HDFSHyperscale HadoopEthernet w/o RoCE

Storage Nodes

HDFSConventional HadoopMapReduceRead9.2GB/Sec4 worker nodes4 - Storage nodes

MapReduceWrite7.4GB/SecMoonshot with 45 - M710Read4.9GB/SecWrite3.4GB/Sec

Comparing Configurations Single Subject Data Mart Use CaseNormalized on CPU and list priceBalanced18 Worker Nodes (Symmetric)ProLiant DL380 Gen9BDO6 Worker Node Blocks (18 nodes)Apollo 4530Same performance (SpecInt)7% higher $/SpecIntWDO16 Compute nodes + 4 Storage NodesApollo 2000/4200 6% better performance (SpecInt)2% lower $/SpecInt

Normalized on:List priceSpecInt

Hyperscale Price/PerformanceCompared with Conventional Cluster11

Normalized on list price and storage capacityProLiant DL380Apollo 2000 +Apollo 4200

HOT Data BALANCED COLD DataIndependent scaling of Compute and Storage

HOT Data *2.8x compute97% of the storage capacity4x the memoryBalanced *1.6x compute1.5x the storage capacity2.5x the memoryCold Data *0.9x of the compute2.1x the storage capacity1.5x the memoryHyperscale vs. Conventional Scale-out

12

* Compared with balanced, conventional full rack cluster

Hyperscale benefits for Big DataHadoop Labels feature (jira Yarn-796)Contributed node labels concepts into Hadoop (Apache 2.6 trunk)Allows scheduling of YARN containers to specific pools of nodesCombined with Hyperscale approach, compute nodes can be dynamically assigned because no data needs to be repartitioned

Hadoop Cluster 1Hive/TezVertica SQL on HadoopData prep (12am 6am)Predictive analytics (6am 12am)SparkHadoop Cluster 1(High Power)Storage NodeStorage Node

NodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNode

Now, the HP engineering team behind this architecture got inside of Hadoop and started playing around, figuring out how could we really teach it to exploit this architecture. We came up with a concept of creating groups of nodes with similar characteristics (big memory, GPU, slower procs) that we called herds, and spoke with the distro vendors about the concepts. Hortonworks said, were working on a feature called labels which will accomplish this same thing, so we worked with them to contribute our code and specifications into the Hadoop trunk (v2.6) to that everybody has it (in Hortonworks 2.2 now; forthcoming in Cloudera 5.4).

In our architecture, think about how powerful labels are. I can give a group of compute nodes a label, and I can say, this is where I want to run SAS, and Ill run Vertica SQL on Hadoop over here, Spark, Hbase, MapReduce, etc. I can slice up the cluster, and, on a moments notice, I can reslice it, because I dont have to repartition any data to do that. I can just add new nodes, or reshape my current node pool, to meet whatever workload demands I have. Many organizations will, during the evening hours, prep, aggregate and roll-up their data for use the coming day; with this approach, I can allocate all of my nodes to these types of batch jobs during the maintenance period, and then reallocate to analytic, ingest and operational tasks during the business day.

If you just blurred your eyes a little bit and said, the storage nodes look like an array, the network was a SAN, the compute nodes were blades and the containers were VMWare, this looks an awful lot like a converged system, right? Weve just interpreted the CS model for the new style of IT.

Whats cool is that we gave this to everybody. At HP, when you talk about giving things to open source, people get the heebie-jeebies. Its actually good; in our best case scenario, our competitors will have this kind of architecture. We want everyone to endorse it.

13

A Modern Sparkitecture for Real-time Analytics (SMACK)

14CassandraMongoHDFSKafkaOrcOracleVerticaSparkSparksqlBatchMLStreaminggraphParquet

HANAVora

Vertica ClusterMPP Columnar DW

VerticaDW

VerticaDW(s)

Vertica ClusterMulti-DC DR

VerticaDW

VerticaDW(s)

Kafka (Mirrored)

Dual Ingestion Pipeline

VerticaSpark

Vertica Kafka Connector

HDFS/AVRO/Topic/YYYY/MM/DD/HHSemi-Structured Data

HDFS/ORC/Topic/YYYY/MM/DDStructured DataNon-relationalHadoop Cluster (Hive)Hortonworks, Cloudera, AWS

1

2

3

4

5

6

7

8

10

11

9

Spark Parallel Streaming TransformationsNear real-timeTransform/ReShape Mapping to Vertica SOTScala, Java, Python, SQLHadoop Spark Cluster

SparkVerticaLoader

WebMobile IoT

MySQLApplications

Centralized Data HubKafka Cluster

Data Center BoundaryParallel Streaming Transformation Loader

Confluent REST ProxyJSON/AVRO Messages

Kafka Connect

A holistic system Ingest *must* take into consideration The Loader must Scalable Clusters with a seperation of concertns when needed based on workload and use cases (CPU, Memory, Storage == Capacity)

Infrastructure trends affecting Big Data architectureWorkload optimizationLow-power SoCs and other accelerators giving rise to workload optimized servers

Faster network fabricDramatic increase in fabric speeds

Multi-temperate storageEnterprise adoption of tiering accelerated (NVMe, flash, etc.) storage

Container-based AppsRunning multiple container apps while hosting a common resource management (YARN)

Big Data

16Application Optimized:Engineered, tested and integrated for workload-specific performance. Choose cartridges for web serving, hosted desktops, video transcoding, application delivery, real-time data processing, and more.

Highly flexible fabrics:Highly flexible chassis fabric with significant bandwidth to move multiple terabits of data per second. Low-latency throughput for fast intra-cartridge communications and external network connectivity to each server

System-on-a-chip:Energy-efficient system-on-a-chip (SoC) design and shared infrastructure across 45 hot-pluggable server cartridges in a 4.3U chassis. Enables low power-consumption, ultra-high density and flexible scale-out with significantly less cabling. Highly flexible fabrics

Dense form factor:Faster innovation and unprecedented scale begins with the Moonshot Chassis. This versatile chassis simplifies management, with four iLO modules that share management responsibility for the 45 servers, power, cooling, network uplinks and switches. Its 4.3U form factor allows for 10 chassis per rack. With a quad core cartridge, that amounts to 1,800 servers in a single rack.

Elastic Platform for Analytics long term viewEvolve to support multiple compute and storage blocks

Low Cost Nodes

SSD Nodes

Disk NodesArchive Nodes

Multi-temperate, Density Optimized Storage using HDFS Tiering, NoSQLs and Objectstores

GPU Nodes

FPGA Nodes

Big Memory Nodes

Density and Workload Optimized compute nodes to accelerate various big data software

Delivering a Scalable Data Lake for the EnterpriseUnlock the most value and performance from HadoopScale without compromising data security, reliability, and ROIEnterprise-Grade, Trusted, and Proven HPE solution

Optimize the Hadoop Data Lake for More Business Value

Elastic Platform for Analytics (Workload and DensityOptimized)High-Performing Analytics Engines for Hadoop Consulting & Implementation Services for Hadoop

Data Securityfor Hadoop

HPE recognized the challenges and limitations of Hadoop and has developed a solution that solves these challenges through a robust, yet flexible, offering that enables organizations to maximize the performance, the infrastructure, and analytics wherewithal of Hadoop, insert more trust in placing your data there, and implement a future-proof data-centric foundation that scales with your evolving business needs

What Matters?Infrastructure. Needs to be flexible, optimized and future-proof to support the increasing complex workloads running on Hadoop.Analytics. Robust analytics engines that deliver performance and stability for business-grade SLAs.Security. Need data encryption and tokenization for sensitive and regulated data protecting the data at rest, in motion and in use. Open. Need a partner solution that supports all the major Hadoop distributions (Hortonworks, Cloudera, MapR) using open architectures and not proprietary offerings

HPE comprehensive solution including flexible, optimized infrastructure (for both standard symmetric and asymmetric architectures), high-performing analyticsengines (best-in-class SQL on Hadoop engine + unstructured human data platform), data encryption security for Hadoop, and consulting and implementation services.

18

Thank you

19

19