CNI Presentation – Deploying "Enterprise" Scale Instructional Management Systems
Deploying Big-Data-as-a-Service (BDaaS) in the Enterprise
-
Upload
big-data-as-a-service-bdaas-meetup -
Category
Data & Analytics
-
view
190 -
download
1
Transcript of Deploying Big-Data-as-a-Service (BDaaS) in the Enterprise
BDaaS MeetupOctober 6, 2016
Tom Phelan, Co-founder and Chief Architect [email protected]
DEPLOYING BIG-DATA-AS-A-SERVICE IN THE ENTERPRISE
Outline
• Big Data: Conflicting Enterprise Needs• Big-Data-as-a-Service (BDaaS)• BDaaS Enterprise Requirements• Design Decisions• Implementation• Demo
Conflicting Enterprise NeedsData scientists want flexibility:• Different versions (and new releases) of Hadoop, Spark, et.al.• Different sets of BI / analytics tools
IT wants control:• Multi-tenancy
• QoS, Data Access
• Security• Network, Authorization/Authentication
Big Data New RealitiesBig Data Traditional
Assumptions
Bare-metal
Data locality
HDFS on local disks
Big Data New Realities
Containers and VMs
Compute and storage separation
In-place access on remote data stores
New Benefits and Value
Big-Data-as-a-Service
Agility and cost savings
Faster time-to-insights
Big-Data-as-a-Service Defined
“BDaaS basically provides a cloud based framework that offers end-to-end big data solutions to business organizations.”
On-Demand, Self-Service, ElasticBig Data Infrastructure, Applications, Analytics
Source: http://www.marketsandmarkets.com/Market-Reports/big-data-as-a-service-market-4129107.html
• Core BDaaS
• Performance BDaaS
• Feature BDaaS
• Integrated BDaaS
Source: www.semantikoz.com/blog/big-data-as-a-service-definition-classification
Four Types of BDaaS
Core BDaaS• Minimal platform, such as Hadoop with YARN
Performance BDaaS • “Downwards” vertical integration• Includes optimized infrastructure• Tight integration with Core BDaaS
Four Types of BDaaS
Source: www.semantikoz.com/blog/big-data-as-a-service-definition-classification
Feature BDaaS • “Upwards” vertical integration• Include features beyond Hadoop• Support for multiple Core BDaaS providers & BI tools
Integrated BDaaS• Full vertical integration and optimization• Includes both Performance BDaaS & Feature BDaaS
Four Types of BDaaS
Source: www.semantikoz.com/blog/big-data-as-a-service-definition-classification
BDaaS – Public Cloud or On-Prem?
BDaaS – Public Cloud or On-Prem
BDaaS On-Prem – Architectures
• Deployment Mechanisms• Bare Metal• Virtualization
• Virtual Machines• Containers
Virtual Machines ContainersSource: www.docker.com/what-docker
Virtualization Tradeoffs
• Tradeoffs depend on virtualization technology • Hypervisor (Virtual Machines)
• Performance: CPU tax• Security: Strong isolation and fault containment• Examples: VMware BDE, OpenStack Sahara
• Linux Containers• Performance: No CPU tax• Security: Isolation and fault containment still developing• Example: BlueData EPIC, Mesos + Myriad
Containers = the Future of Big Data
BDaaS ENTERPRISE REQUIREMENTS
BDaaS – Enterprise Requirements
• Multi-tenancy• Resource Allocation/Isolation• No Noisy Neighbor• Security
• Network• Storage
• User authorization/authentication
• Support for your application• Quickly add support for new apps, frameworks, & versions• Cluster expansion and contraction• Support for HA configurations
BDaaS – Enterprise Requirements
• Infrastructure & operational requirements• Support for capacity expansion• Support for software upgrade• Integration with existing container orchestration
• Kubernetes, Mesos, Docker Swarm
• Integration with existing network configuration and policies• IP allocation and use, routing, security, SDN (e.g. Cisco ACI, VMware NSX)
• Integration with user authentication systems• LDAP/AD
BDaaS – Enterprise Requirements
• Infrastructure & operational requirements (cont’d)• Integration with existing policies
• Supported versions of OS, containers, KVM, VMware, etc.• Monitoring• Limitations on root access• High Availability
• Geographic replication
BDaaS – Enterprise Requirements
DESIGN DECISIONS
BDaaS: Design Decisions I
• Run Hadoop/Spark distros and applications unmodified– Deploy all services that run on a single BM host in a
single container• Multi-tenancy support is key– Network and storage security
• Clusters of containers span physical hosts
BDaaS: Design Decisions II
• Images built to “auto-configure” themselves at time of instantiation– Not all instances of a single image run the same set of
services when instantiated• Master vs. worker cluster nodes
– Support “reboot” of cluster
BDaaS: Design Decisions III
• Maintain the promise of containers– Keep them as stateless as possible– Container storage is always ephemeral – Persistent storage is external to the container
IMPLEMENTATION
Multi-Tenant Deployment
5.5 5.4 1.5 2.4 1.6
Com
pute
Isol
ation
Com
pute
Isol
ation
Team 1 Team 2 Team 3ETL using Hadoop ETL using Spark Machine Learning
Team 1 Team 2 Team3
Multiple teams or business groups
Evaluate different Big Data analytics use cases (e.g. ETL, M/L)
Use different services & tools (e.g. Hive, Notebooks, SparkR)
Use different distributions of Hadoop and/or Spark
BlueData EPIC software platform
Shared server infrastructure
Shared data sets
Multiple distributions, services, tools on shared, cost-effective infrastructure
Shared Data (HDFS)Shared, Centrally Managed Server Infrastructure
How We Did It: Implementation IResource Utilization•CPU cores vs. CPU shares•Over-provisioning of CPU recommended •No over-provisioning of memory
– Swap
Network•Connect containers across hosts•Persistence of IP address across container restart•DHCP/DNS service required for IP allocation and hostname resolution•Deploy VLANs and VxLAN tunnels for tenant-level traffic isolation
Noisy neighbors
Worker HostWorker Host Worker Host
Network Architecture
IP1 IP2 IP3 IP4
External Network
Cluster Provisioning and Automation(Embedded containers for Hadoop/Spark/BI tool nodes)
Internal Networking(BlueData-assigned IPs from floating IP range)
Policy Engine (Resource / placement)
BD IP4 BD IP5 BD IP6
BlueData EPIC
BD IP7
BD IP8 BD IP9 BD IP10 BD IP11
External Switch/Gateway
Tena
nt 1
Tena
nt 2
Tena
nt 3
Internal Gateway
BD IP1 BD IP2 BD IP3
Controller Host
How We Did It: Implementation II
Storage• Expandable, unified / and /data storage
– By default, Docker provides 10 GB (fixed) plus optional / data
• DataTap (version-independent, HDFS-compliant) – Connectivity to external storage
Image Management• Utilize Docker’s image repository• Author new Docker images using Dockerfiles
– Inject parameters at runtime
TIP: Mounting block devices into a container does not support symbolic links (IOW: /dev/sdb will not work, /dm/… PCI device can change across host reboot).
TIP: Docker images can get large. Use “docker squash” to save on size.
How We Did It: Security Considerations• Security is essential since containers and host share one kernel
– Non-privileged containers• Achieved through layered set of capabilities• Different capabilities provide different levels of isolation and protection• Add “capabilities” to a container based on what operations are permitted
How We Did It: Sample Dockerfile# Spark-1.5.2 docker image for RHEL/CentOS 6.x
FROM centos:centos6
# Download and extract sparkRUN mkdir /usr/lib/spark; curl -s http://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.4.tgz | tar -xz -C /usr/lib/spark/
# Download and extract scalaRUN mkdir /usr/lib/scala; curl -s http://www.scala-lang.org/files/archive/scala-2.10.3.tgz | tar xz -C /usr/lib/scala/
# Install zeppelinRUN mkdir /usr/lib/zeppelin; curl -s http://10.10.10.10:8080/build/thirdparty/zeppelin/zeppelin-0.6.0-incubating-SNAPSHOT-v2.tar.gz|tar xz -C /usr/lib/zeppelin
RUN yum clean all && rm -rf /tmp/* /var/tmp/* /var/cache/yum/*
ADD configure_spark_services.sh /root/configure_spark_services.shRUN chmod -x /root/configure_spark_services.sh && /root/configure_spark_services.sh
A Word About Performance …Performance Testing: Spark•Spark 1.x on YARN•HiBench - Terasort
– Data sizes: 100Gb, 500GB, 1TB•10 node physical/virtual cluster•36 cores and112GB memory per node•2TB HDFS storage per node (SSDs)•800GB ephemeral storage
Spark on Docker: PerformanceMB/s
DEMO
NEW – BDaaS On-Prem and CloudBlueData on AWS public cloud•Extending the user experience and value of BlueData to public cloud•Single pane of glass for on-prem and off-prem Big Data workloads•Initial AWS support; then MS Azure, Google Cloud Platform, others•Ask us about our directed availability program for AWS
Q&Awww.bluedata.com