Deploying Big-Data-as-a-Service (BDaaS) in the Enterprise

BDaaS MeetupOctober 6, 2016

Tom Phelan, Co-founder and Chief Architect [email protected]

DEPLOYING BIG-DATA-AS-A-SERVICE IN THE ENTERPRISE

Outline

• Big Data: Conflicting Enterprise Needs• Big-Data-as-a-Service (BDaaS)• BDaaS Enterprise Requirements• Design Decisions• Implementation• Demo

Conflicting Enterprise NeedsData scientists want flexibility:• Different versions (and new releases) of Hadoop, Spark, et.al.• Different sets of BI / analytics tools

IT wants control:• Multi-tenancy

• QoS, Data Access

• Security• Network, Authorization/Authentication

Big Data New RealitiesBig Data Traditional

Assumptions

Bare-metal

Data locality

HDFS on local disks

Big Data New Realities

Containers and VMs

Compute and storage separation

In-place access on remote data stores

New Benefits and Value

Big-Data-as-a-Service

Agility and cost savings

Faster time-to-insights

Big-Data-as-a-Service Defined

“BDaaS basically provides a cloud based framework that offers end-to-end big data solutions to business organizations.”

On-Demand, Self-Service, ElasticBig Data Infrastructure, Applications, Analytics

Source: http://www.marketsandmarkets.com/Market-Reports/big-data-as-a-service-market-4129107.html

http://www.marketsandmarkets.com/Market-Reports/big-data-as-a-service-market-4129107.html

• Core BDaaS

• Performance BDaaS

• Feature BDaaS

• Integrated BDaaS

Source: www.semantikoz.com/blog/big-data-as-a-service-definition-classification

Four Types of BDaaS

http://www.semantikoz.com/blog/big-data-as-a-service-definition-classification

Core BDaaS• Minimal platform, such as Hadoop with YARN

Performance BDaaS • “Downwards” vertical integration• Includes optimized infrastructure• Tight integration with Core BDaaS

Four Types of BDaaS



Feature BDaaS • “Upwards” vertical integration• Include features beyond Hadoop• Support for multiple Core BDaaS providers & BI tools

Integrated BDaaS• Full vertical integration and optimization• Includes both Performance BDaaS & Feature BDaaS

Four Types of BDaaS



BDaaS – Public Cloud or On-Prem?

BDaaS – Public Cloud or On-Prem

BDaaS On-Prem – Architectures

• Deployment Mechanisms• Bare Metal• Virtualization

• Virtual Machines• Containers

Virtual Machines ContainersSource: www.docker.com/what-docker

https://www.docker.com/what-docker

Virtualization Tradeoffs

• Tradeoffs depend on virtualization technology • Hypervisor (Virtual Machines)

• Performance: CPU tax• Security: Strong isolation and fault containment• Examples: VMware BDE, OpenStack Sahara

• Linux Containers• Performance: No CPU tax• Security: Isolation and fault containment still developing• Example: BlueData EPIC, Mesos + Myriad

Containers = the Future of Big Data

BDaaS ENTERPRISE REQUIREMENTS

BDaaS – Enterprise Requirements

• Multi-tenancy• Resource Allocation/Isolation• No Noisy Neighbor• Security

• Network• Storage

• User authorization/authentication

• Support for your application• Quickly add support for new apps, frameworks, & versions• Cluster expansion and contraction• Support for HA configurations


• Infrastructure & operational requirements• Support for capacity expansion• Support for software upgrade• Integration with existing container orchestration

• Kubernetes, Mesos, Docker Swarm

• Integration with existing network configuration and policies• IP allocation and use, routing, security, SDN (e.g. Cisco ACI, VMware NSX)

• Integration with user authentication systems• LDAP/AD


• Infrastructure & operational requirements (cont’d)• Integration with existing policies

• Supported versions of OS, containers, KVM, VMware, etc.• Monitoring• Limitations on root access• High Availability

• Geographic replication


DESIGN DECISIONS

BDaaS: Design Decisions I

• Run Hadoop/Spark distros and applications unmodified– Deploy all services that run on a single BM host in a

single container• Multi-tenancy support is key– Network and storage security

• Clusters of containers span physical hosts

BDaaS: Design Decisions II

• Images built to “auto-configure” themselves at time of instantiation– Not all instances of a single image run the same set of

services when instantiated• Master vs. worker cluster nodes

– Support “reboot” of cluster

BDaaS: Design Decisions III

• Maintain the promise of containers– Keep them as stateless as possible– Container storage is always ephemeral – Persistent storage is external to the container

IMPLEMENTATION

Multi-Tenant Deployment

5.5 5.4 1.5 2.4 1.6

Com

pute

Isol

ation

Com

pute

Isol

ation

Team 1 Team 2 Team 3ETL using Hadoop ETL using Spark Machine Learning

Team 1 Team 2 Team3

Multiple teams or business groups

Evaluate different Big Data analytics use cases (e.g. ETL, M/L)

Use different services & tools (e.g. Hive, Notebooks, SparkR)

Use different distributions of Hadoop and/or Spark

BlueData EPIC software platform

Shared server infrastructure

Shared data sets

Multiple distributions, services, tools on shared, cost-effective infrastructure

Shared Data (HDFS)Shared, Centrally Managed Server Infrastructure

How We Did It: Implementation IResource Utilization•CPU cores vs. CPU shares•Over-provisioning of CPU recommended •No over-provisioning of memory

– Swap

Network•Connect containers across hosts•Persistence of IP address across container restart•DHCP/DNS service required for IP allocation and hostname resolution•Deploy VLANs and VxLAN tunnels for tenant-level traffic isolation

Noisy neighbors

Worker HostWorker Host Worker Host

Network Architecture

IP1 IP2 IP3 IP4

External Network

Cluster Provisioning and Automation(Embedded containers for Hadoop/Spark/BI tool nodes)

Internal Networking(BlueData-assigned IPs from floating IP range)

Policy Engine (Resource / placement)

BD IP4 BD IP5 BD IP6

BlueData EPIC

BD IP7

BD IP8 BD IP9 BD IP10 BD IP11

External Switch/Gateway

Tena

nt 1

Tena

nt 2

Tena

nt 3

Internal Gateway

BD IP1 BD IP2 BD IP3

Controller Host

How We Did It: Implementation II

Storage• Expandable, unified / and /data storage

– By default, Docker provides 10 GB (fixed) plus optional / data

• DataTap (version-independent, HDFS-compliant) – Connectivity to external storage

Image Management• Utilize Docker’s image repository• Author new Docker images using Dockerfiles

– Inject parameters at runtime

TIP: Mounting block devices into a container does not support symbolic links (IOW: /dev/sdb will not work, /dm/… PCI device can change across host reboot).

TIP: Docker images can get large. Use “docker squash” to save on size.

How We Did It: Security Considerations• Security is essential since containers and host share one kernel

– Non-privileged containers• Achieved through layered set of capabilities• Different capabilities provide different levels of isolation and protection• Add “capabilities” to a container based on what operations are permitted

How We Did It: Sample Dockerfile# Spark-1.5.2 docker image for RHEL/CentOS 6.x

FROM centos:centos6

# Download and extract sparkRUN mkdir /usr/lib/spark; curl -s http://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.4.tgz | tar -xz -C /usr/lib/spark/

# Download and extract scalaRUN mkdir /usr/lib/scala; curl -s http://www.scala-lang.org/files/archive/scala-2.10.3.tgz | tar xz -C /usr/lib/scala/

# Install zeppelinRUN mkdir /usr/lib/zeppelin; curl -s http://10.10.10.10:8080/build/thirdparty/zeppelin/zeppelin-0.6.0-incubating-SNAPSHOT-v2.tar.gz|tar xz -C /usr/lib/zeppelin

RUN yum clean all && rm -rf /tmp/* /var/tmp/* /var/cache/yum/*

ADD configure_spark_services.sh /root/configure_spark_services.shRUN chmod -x /root/configure_spark_services.sh && /root/configure_spark_services.sh

A Word About Performance …Performance Testing: Spark•Spark 1.x on YARN•HiBench - Terasort

– Data sizes: 100Gb, 500GB, 1TB•10 node physical/virtual cluster•36 cores and112GB memory per node•2TB HDFS storage per node (SSDs)•800GB ephemeral storage

Spark on Docker: PerformanceMB/s

NEW – BDaaS On-Prem and CloudBlueData on AWS public cloud•Extending the user experience and value of BlueData to public cloud•Single pane of glass for on-prem and off-prem Big Data workloads•Initial AWS support; then MS Azure, Google Cloud Platform, others•Ask us about our directed availability program for AWS

Q&Awww.bluedata.com

Deploying Big-Data-as-a-Service (BDaaS) in the Enterprise

Data & Analytics

Transcript of Deploying Big-Data-as-a-Service (BDaaS) in the Enterprise