A New "Sparkitecture" for modernizing your data warehouse
-
Upload
dataworks-summithadoop-summit -
Category
Technology
-
view
396 -
download
1
Transcript of A New "Sparkitecture" for modernizing your data warehouse
Discover 2016 PowerPoint template overview
A New Sparkitecture for Modernizing Your Data WarehouseRanga Nathan (Big Data Solutions Product Management)Jack Gudenkauf (Big Data Professional Services Architect)June 9, 2016
This is a sample Title Slide with Picture ideal for including a dark picture with a brief title and subtitle.
A selection of pre-approved title slides are available in the HPE Title Slide Library. The location of the library will be communicated later.
To insert a slide with a different picture from the HPE Title Slide Library:
Open the file HPE_16x9_Title_Slide_Library.pptx
From the Slide thumbnails pane, select the slide with the picture you would like to use in your presentation and click Copy (Ctrl+C)
Open a copy of the new HPE 16x9 template (Standard or Events) or your current presentation
In the Slide thumbnails pane, click Paste (Ctrl+V)
A Paste Options clipboard icon will appear. Click the icon and select Keep Source Formatting. (Ctrl+K)
1
40% of Enterprises struggle to identify, integrate, and Manage Big Data with existing technology
More systems to manageMore complexity to integrateMore data to identify
Barriers:
Its fair to say that most companies recognize the need have the desire to be data driven, but execution is challenging. Wikibons survey in 2015 revealed that enterprises have difficulty integrating Big Data with existing infrastructure, and struggle to identify the technologies to use . Its the irony of having so much data, but struggling to make the most of it, so why?
1st problem is silos and lack of alignment Organizations are drowning in a sea of data, but they dont know what to do with your data to get value and have silo-ed uncoordinated efforts and science experiments across the org
2nd problem is the technology gapTraditional systems, architectures, and approaches were never designed for todays data, falling short on the ability to handle the scale, the speed, or the variety of data, and deliver insights fast enough to meet business needs or they are just too cost prohibitive to deploy and manage
3rd problem is the inability to bridge data to valueJust having the right technology tools in place doesnt mean youll get the value of the data. Today theres an operational chasm between the people and tools with value.
So you want to get more out of your existing investments and build for the future, but how?
2
Hadoop and the data lake3
AppAppAppHadoop Data LakeVision:Data-centric foundation for all data and appsElastic data management & compute platform for all data Single platform for all analytical workloads Reality:Data swamps due to lack of oversight and data governance Dearth of skilled resources to extract value from the dataSub-optimal performance with traditional architectures Cannot scale to handle multi-tenant workload complexity
Data Ponds
Hadoop, founded in 2006, was designed to support large scale, distributed data management, storage and data processing for all data structured and unstructured. Its proven to be cost-effective at doing that, effectively solving the challenge of managing all the growing data across their organization.
Vision of the Hadoop data lake all workloads/analytics/applications running on a common data set, flexible/elastic platform for all data and analytical workloads, dont move data to different DBs for different workloadsReality of the Hadoop data lake reality has fallen short of the vision for most organizations. instead of a data lake, you have data pondssilos across BGs, silo-cluster sprawl across your Hadoop environments (1 cluster for MapReduce, 1 for SQL, 1 for Spark, etc), limited workloads and ultimately limited value
While many have been using Hadoop as a data repository, for simple workloads like ETL and pre-processing data, and have been approaching analytics as an afterthought, we often hear from customers saying I wish I could analyze and interact with my data in Hadoop at enterprise-grade performance and reliability and identify the use cases that would create the most value for my organization.
Business needs:* get more value out of Hadoop, uncover more insights from the data* run more workloads* consolidate data and infrastructure* scale Hadoop across a common, elastic, shared infrastructure*
3
Conventional Wisdom Regarding Deploying a Data Lake Infrastructure4
Use Cases:ProLiant DL380
Apollo 4530
Apollo 4200Traditional Hadoop architectureBatch workloads with predictable growth
Lowers Big Data Costs for larger deploymentsMatch compute to workload
Large internal storageIdeal for large data volumes and batch workloads where density or cost per GB is keyDensity-OptimizedTraditionalData Lakes & HubsIngestion of multiple types and sources of dataAggregation, Transformation and VisualizationBatch, Interactive, Real-time workloads
Data Warehouse ModernizationData Staging & landing zoneMigration of operational data storesActive archivingBatch workloads
A Big Data Journey
ETL Offload
Archival
Deep Learning
Event Processing
In Memory Analytics
5
HPE Elastic Platform for AnalyticsFlexible Convergence for Big Data Workloads
Low Latency ComputeEvent ProcessingMoonshot M710PBig Memory ComputeIn Memory AnalyticsApollo xl170r w 512G memoryArchival StorageApollo 4200 w 6TB HDDHigh Latency ComputeETL Offload and ArchivalApollo xl170 w 256G memoryHPC ComputeDeep LearningApollo xl190r w GPUsHDFS StorageApollo 4200 w 3TB HDD
6
Unique ValueHPE Workload- and Density-Optimized (WDO) SolutionHPE Elastic Platform for AnalyticsInnovation delivering unique value to customers and the open source communityData ConsolidationShared storage pool for multiple Big Data environmentsMaximum ElasticityDynamic cluster provisioning from compute pools without repartitioning data Flexible ScalabilityScale compute and storage independentlyBreakthrough EconomicsWorkload optimized components for better density, cost and power
Ethernet
HPE Apollo 4xx0
HPE Moonshot or HPE Apollo
7
Key Takeaway: HP Big Data Reference Architecture is another innovation from HP that leverages the strength of HPs portfolio to deliver value for our customers via a differentiated solution that combines HP Moonshot server and HP Apollo storage servers.
The Value Proposition and the HOWTraditional scale-up infrastructures separate compute and storage for flexibility of scaling them independently, but at the cost of management complexity and costScale-out architectures, and new technologies that use DAS storage within a server, lose this ability to scale independently by combining compute and storage in one box a tradeoff for achieving hyper-scalability and simple managementHP Big Data Reference architecture deploys standard Hadoop distribution in an asymmetric fashion running the storage related components such as Hadoop Distributed File System (HDFS) and Hbase (open source non-relational distributed database) on Apollo Density Optimized servers and compute related components running under Yarn on Moonshot Hyperscale servers. This essentially provides the best-of-both worlds; ability to scale compute and storage independently without losing the benefits of scale-out infrastructure.In order to make this more flexible HP worked with Hortonworks to create a new feature in Hadoop called Yarn Labels innovation that we contributed to Open Source! Yarn Labels allows us to create pools of compute nodes where applications run so it is possible to dynamically provision clusters without repartitioning data (since data can be shared across compute nodes) We can scale compute and storage independently by simply adding compute nodes or storage nodes to scale performance linearly This fundamentally changes the economics of the solution across scale, performance and cost efficiency to meet specific use case and workload needs!7
Building blocks for the HPE Elastic Platform for Analytics8
HP Apollo 2000 System
HP Apollo 4200 Scalable SystemA density optimized compute platform that offers double the density of traditional 1U servers and high memory/core ratiosA cost-effective industry standard storage server purpose built for big data with converged infrastructure that offers high density energy-efficient storage
HP Moonshot SystemA complete server system engineered for specific workloads and delivered in a dense, energy-efficient package
HP ProLiant DL300 System
The industrys most popular server balances the latest compute and memory technologies with internal storage and flash options, coupled with industry-leading management and serviceabilityWorkload-optimized compute nodes for Spark, Hive/Tez, MapReduce, YARN, Vertica SQL on Hadoop, and other analytics and batch workloadWorkload-optimized compute nodes for Hbase, Kafka and other low-latency, streaming workloads. Ideal for highest density + lowest power requirementsWorkload-optimized storage nodes for HDFS, Kudu, and building a multi-temperate data lake environment. Ideal for EDW Offload workloads, and a foundation storage block for workload consolidationBalanced worker node for batch and single-function use cases. Scalable storage node, with smaller fault domain than Apollo 4200, for workload consolidation use cases.
Yes, But Does It Perform?
Compute Nodes
Storage Nodes
HDFSHyperscale HadoopEthernet w/o RoCE
Storage Nodes
HDFSConventional HadoopMapReduceRead9.2GB/Sec4 worker nodes4 - Storage nodes
MapReduceWrite7.4GB/SecMoonshot with 45 - M710Read4.9GB/SecWrite3.4GB/Sec
Comparing Configurations Single Subject Data Mart Use CaseNormalized on CPU and list priceBalanced18 Worker Nodes (Symmetric)ProLiant DL380 Gen9BDO6 Worker Node Blocks (18 nodes)Apollo 4530Same performance (SpecInt)7% higher $/SpecIntWDO16 Compute nodes + 4 Storage NodesApollo 2000/4200 6% better performance (SpecInt)2% lower $/SpecInt
Normalized on:List priceSpecInt
Hyperscale Price/PerformanceCompared with Conventional Cluster11
Normalized on list price and storage capacityProLiant DL380Apollo 2000 +Apollo 4200
HOT Data BALANCED COLD DataIndependent scaling of Compute and Storage
HOT Data *2.8x compute97% of the storage capacity4x the memoryBalanced *1.6x compute1.5x the storage capacity2.5x the memoryCold Data *0.9x of the compute2.1x the storage capacity1.5x the memoryHyperscale vs. Conventional Scale-out
12
* Compared with balanced, conventional full rack cluster
Hyperscale benefits for Big DataHadoop Labels feature (jira Yarn-796)Contributed node labels concepts into Hadoop (Apache 2.6 trunk)Allows scheduling of YARN containers to specific pools of nodesCombined with Hyperscale approach, compute nodes can be dynamically assigned because no data needs to be repartitioned
Hadoop Cluster 1Hive/TezVertica SQL on HadoopData prep (12am 6am)Predictive analytics (6am 12am)SparkHadoop Cluster 1(High Power)Storage NodeStorage Node
NodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNodeNode
Now, the HP engineering team behind this architecture got inside of Hadoop and started playing around, figuring out how could we really teach it to exploit this architecture. We came up with a concept of creating groups of nodes with similar characteristics (big memory, GPU, slower procs) that we called herds, and spoke with the distro vendors about the concepts. Hortonworks said, were working on a feature called labels which will accomplish this same thing, so we worked with them to contribute our code and specifications into the Hadoop trunk (v2.6) to that everybody has it (in Hortonworks 2.2 now; forthcoming in Cloudera 5.4).
In our architecture, think about how powerful labels are. I can give a group of compute nodes a label, and I can say, this is where I want to run SAS, and Ill run Vertica SQL on Hadoop over here, Spark, Hbase, MapReduce, etc. I can slice up the cluster, and, on a moments notice, I can reslice it, because I dont have to repartition any data to do that. I can just add new nodes, or reshape my current node pool, to meet whatever workload demands I have. Many organizations will, during the evening hours, prep, aggregate and roll-up their data for use the coming day; with this approach, I can allocate all of my nodes to these types of batch jobs during the maintenance period, and then reallocate to analytic, ingest and operational tasks during the business day.
If you just blurred your eyes a little bit and said, the storage nodes look like an array, the network was a SAN, the compute nodes were blades and the containers were VMWare, this looks an awful lot like a converged system, right? Weve just interpreted the CS model for the new style of IT.
Whats cool is that we gave this to everybody. At HP, when you talk about giving things to open source, people get the heebie-jeebies. Its actually good; in our best case scenario, our competitors will have this kind of architecture. We want everyone to endorse it.
13
A Modern Sparkitecture for Real-time Analytics (SMACK)
14CassandraMongoHDFSKafkaOrcOracleVerticaSparkSparksqlBatchMLStreaminggraphParquet
HANAVora
Vertica ClusterMPP Columnar DW
VerticaDW
VerticaDW(s)
Vertica ClusterMulti-DC DR
VerticaDW
VerticaDW(s)
Kafka (Mirrored)
Dual Ingestion Pipeline
VerticaSpark
Vertica Kafka Connector
HDFS/AVRO/Topic/YYYY/MM/DD/HHSemi-Structured Data
HDFS/ORC/Topic/YYYY/MM/DDStructured DataNon-relationalHadoop Cluster (Hive)Hortonworks, Cloudera, AWS
1
2
3
4
5
6
7
8
10
11
9
Spark Parallel Streaming TransformationsNear real-timeTransform/ReShape Mapping to Vertica SOTScala, Java, Python, SQLHadoop Spark Cluster
SparkVerticaLoader
WebMobile IoT
MySQLApplications
Centralized Data HubKafka Cluster
Data Center BoundaryParallel Streaming Transformation Loader
Confluent REST ProxyJSON/AVRO Messages
Kafka Connect
A holistic system Ingest *must* take into consideration The Loader must Scalable Clusters with a seperation of concertns when needed based on workload and use cases (CPU, Memory, Storage == Capacity)
Infrastructure trends affecting Big Data architectureWorkload optimizationLow-power SoCs and other accelerators giving rise to workload optimized servers
Faster network fabricDramatic increase in fabric speeds
Multi-temperate storageEnterprise adoption of tiering accelerated (NVMe, flash, etc.) storage
Container-based AppsRunning multiple container apps while hosting a common resource management (YARN)
Big Data
16Application Optimized:Engineered, tested and integrated for workload-specific performance. Choose cartridges for web serving, hosted desktops, video transcoding, application delivery, real-time data processing, and more.
Highly flexible fabrics:Highly flexible chassis fabric with significant bandwidth to move multiple terabits of data per second. Low-latency throughput for fast intra-cartridge communications and external network connectivity to each server
System-on-a-chip:Energy-efficient system-on-a-chip (SoC) design and shared infrastructure across 45 hot-pluggable server cartridges in a 4.3U chassis. Enables low power-consumption, ultra-high density and flexible scale-out with significantly less cabling. Highly flexible fabrics
Dense form factor:Faster innovation and unprecedented scale begins with the Moonshot Chassis. This versatile chassis simplifies management, with four iLO modules that share management responsibility for the 45 servers, power, cooling, network uplinks and switches. Its 4.3U form factor allows for 10 chassis per rack. With a quad core cartridge, that amounts to 1,800 servers in a single rack.
Elastic Platform for Analytics long term viewEvolve to support multiple compute and storage blocks
Low Cost Nodes
SSD Nodes
Disk NodesArchive Nodes
Multi-temperate, Density Optimized Storage using HDFS Tiering, NoSQLs and Objectstores
GPU Nodes
FPGA Nodes
Big Memory Nodes
Density and Workload Optimized compute nodes to accelerate various big data software
Delivering a Scalable Data Lake for the EnterpriseUnlock the most value and performance from HadoopScale without compromising data security, reliability, and ROIEnterprise-Grade, Trusted, and Proven HPE solution
Optimize the Hadoop Data Lake for More Business Value
Elastic Platform for Analytics (Workload and DensityOptimized)High-Performing Analytics Engines for Hadoop Consulting & Implementation Services for Hadoop
Data Securityfor Hadoop
HPE recognized the challenges and limitations of Hadoop and has developed a solution that solves these challenges through a robust, yet flexible, offering that enables organizations to maximize the performance, the infrastructure, and analytics wherewithal of Hadoop, insert more trust in placing your data there, and implement a future-proof data-centric foundation that scales with your evolving business needs
What Matters?Infrastructure. Needs to be flexible, optimized and future-proof to support the increasing complex workloads running on Hadoop.Analytics. Robust analytics engines that deliver performance and stability for business-grade SLAs.Security. Need data encryption and tokenization for sensitive and regulated data protecting the data at rest, in motion and in use. Open. Need a partner solution that supports all the major Hadoop distributions (Hortonworks, Cloudera, MapR) using open architectures and not proprietary offerings
HPE comprehensive solution including flexible, optimized infrastructure (for both standard symmetric and asymmetric architectures), high-performing analyticsengines (best-in-class SQL on Hadoop engine + unstructured human data platform), data encryption security for Hadoop, and consulting and implementation services.
18
Thank you
19
19