Post on 15-Jul-2015
ROME 27-28 march 2015 - Speaker’s name
Dive into Sahara
Davide Del Vecchio Francesco Vollero Matteo Bernacchi
March 27, 2015
ROME 27-28 march 2015 - Speaker’s name
Davide Del Vecchio
•Principal Domain Architect Middleware
•Previous experience with analytics and Big Data
•Background in Science
•Passionate about technology
Who are we
Francesco Vollero
● OpenStack Technical
Specialist in EMEA● Developer background -
in Openstack since
Grizzly● Core contributor in
packstack, openstack-
puppet● Snooping other
openstack components
like Sahara● Functional programming
brain oriented :)
Matteo Bernacchi
•Senior Infrastructure Consultant
•Experienced in cloud solutions deployment
•Supporter of FOSS technologies since 2003
ROME 27-28 march 2015 - Speaker’s name
•An introduction to Big Data•An overview of the OpenStack components•A (Moderately) Brief Introduction to Sahara•Sahara in action
Agenda
ROME 27-28 march 2015 - Speaker’s name
Everything You Ever Wanted to Know About Big Data But Only Had About 20
Minutes to Learn
ROME 27-28 march 2015 - Speaker’s name
Insert some very Big Data here …
What is it
•Something you cannot drag'n drop
•Something you cannot think to process in a reasonable amount of time on your machines
•Something that needs on-purpose algorithm to work with
ROME 27-28 march 2015 - Speaker’s name
It is not a just a matter of volume ...
There are many other key aspects
•Data must be processed in a small time frame
• Data sets are different from traditional relational/not relational including machine and social data
•The large availability of computational and mathematical tools in the open source goes beyond the academia
•It's the second iteration of the feedback process of open source tools that are now available as a commodity
•Data visualization tools is an accelerator to the movement
ROME 27-28 march 2015 - Speaker’s name
How do I commoditize Big Data
ROME 27-28 march 2015 - Speaker’s name
-2004: MapReduce Whitepaper (Google)
- Described the MapReduce algorithm
- Kind of a big deal
-Many were already doing this; it's a very basic prescription
-Specification for easy extensibility
-THIS was the big deal
-Google's vision for clean extension points and design drove the Big Data movement
A Bit of History: MapReduce
ROME 27-28 march 2015 - Speaker’s name
-2007: Apache Hadoop
-First and still most significant OSS Big Data engine
-Originally built by Yahoo!
-“Hadoop” now used to refer both to Hadoop itself and the large ecosystem of supporting technologies
-Dominant in the market now, but there are new contenders
-Named after a developer's son's stuffed elephant
A Bit of History: Hadoop
ROME 27-28 march 2015 - Speaker’s name
MapReduce: What Does It Do
•MAP•Iterate over records•Emit (0, 1, or n) key-value pairs for each•Word Count:
•Input: “Let's reduce map reduce”
•Output: (“Let's”: 1), (“reduce”: 1), (“map”: 1), (“reduce”: 1)
•REDUCE•Gather all the KVPs for each key together•Apply some function to all of each key's values and emit something for each key•Word Count:
•Input: {“Let's”: [1], “map”: [1], “reduce”: [1, 1]}
•Ouptut: {“Let's”: 1, “map”: 1, “reduce”: 2}
ROME 27-28 march 2015 - Speaker’s name
So... It's... GROUP BY.
•Yes, it is kinda GROUP BY.•You are now authorized to laugh at Big Data engineers.•It is, however, VERY easy to parallelize.
•M Mappers can be run against any amount of data on any number of nodes, in small chunks
•N Reducers only have to deal with the data for any one key at a time
ROME 27-28 march 2015 - Speaker’s name
MapReduce Extension Points(Per Hadoop MapReduce Interface)
•An Input Reader
•Divides data into “splits” (1 per mapper)
•Usually 16-128MB•A Map Function•A Combiner Function
•Just a reduce function within a mapper process
•With a combiner, mappers only emit one KVP per key
•A Partition Function
•Determines which key goes to which reducer
•Default is hash(key) % len(reducers)•(Optional) A Compare Function
•Orders final output•A Reduce Function•An Output Writer
•By default, writes one file per reducer and just dumps text
ROME 27-28 march 2015 - Speaker’s name
MapReduce Abstraction Layers
•Hive (SQL-like)•DROP TABLE IF EXISTS words;
•CREATE TABLE words( text string ) row format delimited fields terminated by '\n' stored as textfile;
•LOAD DATA LOCAL INPATH ‘data_path' OVERWRITE INTO TABLE words;
•SELECT word, COUNT(*) FROM words LATERAL VIEW explode(split(text,' ')) lTable AS word GROUP BY word;
•Pig (relational flow)•raw_input = LOAD './input.txt‘;
•words = FOREACH raw_input GENERATE FLATTEN(TOKENIZE((chararray)$0)) AS word;
•grouped = GROUP words BY word;
•counted = FOREACH grouped GENERATE group, COUNT(words);
•STORE counted INTO './wordcount';
ROME 27-28 march 2015 - Speaker’s name
Hadoop: HDFS
Hadoop Distributed File System•Large block size
•128MB defaultReplication
•3 default, 512 max
Strictly separate from logic – can be used with any algo
•Giraph: Graph Processing
•Mahout: Machine Learning•The name node tracks data blocks and replication•Data nodes hold data
ROME 27-28 march 2015 - Speaker’s name
Hadoop: Data Processing
•Namenode tasks
•Breaks jobs (whole dataset) into tasks (one mapper or reducer)
•Assigns tasks to data nodes
•Tracks progress to completion
•Retry failed tasks a configurable number of times
•Allows Hadoop clusters to be run on error-prone commodity hardware•Datanode tasks
•Tracks its own map and reduce jobs
•Transfers data to other nodes as needed
•Each data node has slots for map and reduce tasks (to be run in JVMs)
ROME 27-28 march 2015 - Speaker’s name
Hadoop: The Ecosystem
•Oozie: Workflow manager (chained jobs)•Data pipelining: Flume, Scribe, Kafka•RDBMS integration: Sqoop•Tabular interface for unstructured data: Hcatalog•M/R Abstraction: Pig, Hive•SO MANY OTHERS
ROME 27-28 march 2015 - Speaker’s name
OpenStack: take a look at the best place to host your Big Data platform
OpenStack: take a look at the best place to host your Big Data platform
ROME 27-28 march 2015 - Speaker’s name
ROME 27-28 march 2015 - Speaker’s name
Why does the world need OpenStack?
● Cloud is widely seen as the next-generation IT delivery modelo Agile & Flexibleo Utility-based on-demand consumptiono Self-service driving down administrative overhead and
maintenance● Public clouds are setting the benchmark of how IT could be delivered to
userso Not all organisations are ready for public cloud
● Applications are being written differently today-o More tolerant of failureo Making use of scale-out architecture
ROME 27-28 march 2015 - Speaker’s name
● Our data is too largeo Volumes of data are being generated at unprecedented levelso Most of this data is unstructured
● Service requests are too largeo More and more devices are coming onlineo Tablets, phones, laptops, BYOD generation…
● Crucially, applications weren’t written to cope with the demand!o Traditional infrastructure capabilities are being exhaustedo Service uptime, QoS, KPI’s and SLA’s are slipping
Major issues with traditional infrastructure…
ROME 27-28 march 2015 - Speaker’s name
Workloads are evolving…
● Typically each tier resides on a single machine● Doesn’t tolerate any downtime● Relies on underlying infrastructure for
availability● Applications scale-up, not out
● Workload resides across multiple machines● Applications built to tolerate failure● Does not rely on underlying infrastructure● Applications scale-out, not up
Cloud-enabled WorkloadsTraditional workloads
ROME 27-28 march 2015 - Speaker’s name
Or an easier analogy...
PETS = TRADITIONAL WORKLOADS FARM ANIMALS = CLOUD WORKLOADS
● Farm animals have tag numbers like piggie242.redhat.com
● They are almost identical to each other
● When they get ill you get another one
● Pets are given names like lasy.internal.redhat.com
● They are unique, lovingly hand raised and cared for
● When they get ill you nurse them back to health
ROME 27-28 march 2015 - Speaker’s name
OpenStack is typically suitable for the following use cases —● A public cloud-like Infrastructure-as-a-Service cloud platform
o Internal “Infrastructure on Demand” - private cloudo Test and Development environments - e.g. sandboxo Cloud service provider platform - reselling compute, network &
storage
● Building a scale-out platform for cloud-enabled workloadso Web-scale applications, e.g. NetFlix-like, photo/video-streamingo Academic or pharma workloads, e.g. genetic sequencing
So, how does OpenStack fit in?
ROME 27-28 march 2015 - Speaker’s name
•OpenStack is made up of individual autonomous components
•All of which are designed to scale-out to accommodate throughput and
availability
•OpenStack is considered more of a framework, that relies on drivers and
plugins
•Largely written in Python and is heavily dependent on Linux
OpenStack Architecture
ROME 27-28 march 2015 - Speaker’s name
• Keystone provides a common authentication and authorisation store for OpenStack
• Responsible for users, their roles, and to which project(s) they belong to
• Provides a catalogue of all other OpenStack services
• All OpenStack services typically rely on Keystone to verify a user’s request
OpenStack Identity Service (Keystone)
ROME 27-28 march 2015 - Speaker’s name
• Nova is responsible for the lifecycle of running instances within OpenStack
• Manages multiple different hypervisor types via drivers, e.g-
•Red Hat Enterprise Linux (+KVM)
•VMware vSphere
OpenStack Compute (Nova)
ROME 27-28 march 2015 - Speaker’s name
•Glance provides a mechanism for the storage and retrieval of disk
images/templates
•Supports a wide variety of image formats, including qcow2, vmdk, ami, vhd
and ova
•Many different backend storage options for images, including Swift…
OpenStack Image Service (Glance)
ROME 27-28 march 2015 - Speaker’s name
• Swift provides a mechanism for storing and retrieving arbitrary unstructured data
• Provides an object based interface via a RESTful/HTTP-based API
• Highly fault-tolerant with replication, self-healing, and load-balancing
• Architected to be implemented using commodity compute and storage
OpenStack Object Store (Swift)
ROME 27-28 march 2015 - Speaker’s name
• Neutron is responsible for providing networking to running instances within
OpenStack
• Provides an API for defining, configuring, and using networks
• Relies on a plugin architecture for implementation of networks, examples include-
•Open vSwitch (default in Red Hat’s distribution)
•Cisco, PLUMgrid, VMware NSX, Arista, Mellanox, Brocade, etc.
OpenStack Networking (Neutron)
ROME 27-28 march 2015 - Speaker’s name
• Cinder provides block storage to instances running within OpenStack
• Used for providing persistent and/or additional storage
• Relies on a plugin/driver architecture for implementation, examples include-
• Red Hat Storage (GlusterFS), IBM XIV, HP Leftland, 3PAR, etc.
OpenStack Volume Service (Cinder)
ROME 27-28 march 2015 - Speaker’s name
• Heat facilitates the creation of ‘application stacks’ made from multiple resources
• Stacks are imported as a descriptive template language
• Heat manages the automated orchestration of resources and their dependencies
• Allows for dynamic scaling of applications based on configurable metrics
OpenStack Orchestration (Heat)
ROME 27-28 march 2015 - Speaker’s name
• Ceilometer is a central collection of metering and monitoring data
• Primarily used for chargeback of resource usage
• Ceilometer consumes data from the other components - e.g. via agents
• Architecture is completely extensible - meter what you want to - expose via API
OpenStack Telemetry (Ceilometer)
ROME 27-28 march 2015 - Speaker’s name
• Horizon is OpenStack’s web-based self-service portal
• Sits on-top of all of the other OpenStack components via API interaction
• Provides a subset of underlying functionality
• Examples include: instance creation, network configuration, block storage attachment
• Exposes an administrative extension for basic tasks, e.g. user creation
OpenStack Dashboard (Horizon)
ROME 27-28 march 2015 - Speaker’s name
• All OpenStack components expose a RESTful API for communication
• A stateless, shared-nothing API service provides scalability and fault-tolerance
• Keystone manages a list of these API endpoints in its catalog
Common OpenStack Architecture
ROME 27-28 march 2015 - Speaker’s name
Common OpenStack Architecture
Where’s Nova?
http://server0:8773
server1:8773
server2:8773
server3:8773
LB
server0:8773
ROME 27-28 march 2015 - Speaker’s name
• In addition to providing API services, each component has a set of workers
• These workers actually do the heavy lifting behind the scenes
• Workers (and API services) scale-out and communicate using a message bus
(RabbitMQ)
• Example with Nova:
Common OpenStack Architecture
Nova API
Nova Compute
Nova Compute
Nova Compute
RabbitMQ AMQP
ROME 27-28 march 2015 - Speaker’s name
• In addition to providing API services, each component has a set of workers
• These workers actually do the heavy lifting behind the scenes
• Workers (and API services) scale-out and communicate using a message bus
(RabbitMQ)
• Example with Nova:
Common OpenStack Architecture
Nova API
Nova Compute
Nova Compute
Nova Compute
RabbitMQ AMQP
ROME 27-28 march 2015 - Speaker’s name
• In addition to providing API services, each component has a set of workers
• These workers actually do the heavy lifting behind the scenes
• Workers (and API services) scale-out and communicate using a message bus (RabbitMQ)
• Example with Nova:
Common OpenStack Architecture
Nova API
Nova Compute
Nova Compute
Nova Compute
RabbitMQ AMQP
ROME 27-28 march 2015 - Speaker’s name
• OpenStack services store state information in a SQL-based database, default is MySQL
• Each service can use it’s own database infrastructure or share a common platform
• For resilience and throughput, replicated multi-master databases can be implemented
• Example with Keystone:
Common OpenStack Architecture
Keystone Server
LB
Multi-Master ReplicationUsing Galera
ROME 27-28 march 2015 - Speaker’s name
• OpenStack services check a users request with Keystone for both authentication and authorisation
• Example with Nova:
Common OpenStack Architecture
Keystone Server
Nova API
Launch an Instance
1) Are they authenticated?2) Are they allowed to launch an instance?
Success/Fail
ROME 27-28 march 2015 - Speaker’s name
OpenStack Architecture
ROME 27-28 march 2015 - Speaker’s name
ROME 27-28 march 2015 - Speaker’s name
OpenStack Sahara, or what we supposed to talk about today
ROME 27-28 march 2015 - Speaker’s name
Hadoop without Sahara: the challenges•Hadoop clusters are difficult to configure and few have the expert knowledge to do fine•Commodity hardware is cheap but requires frequent (costly, expert) maintenance•Demand for data processing varies over time, even with sophisticated scheduling•Baremetal Hadoop cluster nodes can fail, leading to a loss of service•Many public BigData services don't give you flexibility
ROME 27-28 march 2015 - Speaker’s name
Hadoop with Sahara: beat the challenges
•OpenStack Sahara lets you to:
•Deploy Hadoop Clusters (predictable and repeatable)
•Scaling the deployed clusters
•Define and run jobs
•Offer a programmatic API interface and a web console•Furthermore:
•It support many Hadoop Distributions
•It is well integrated with other OpenStack Services
•Enables to use Hadoop even with little knowledge about it
ROME 27-28 march 2015 - Speaker’s name
Sahara: the project
History:
•Started at Portland Summit
•Incubated in Icehouse
•Integrated in Juno
Main components:
•Sahara REST API
•Python REST Client and Sahara Pages (Integrated with Horizon)
•Elastic Data Processing
•Provisioning Engine
•Vendor Plugins (Vanilla, Intel, Hortonworks, Cloudera, MapR)
ROME 27-28 march 2015 - Speaker’s name
Sahara: Architecture
ROME 27-28 march 2015 - Speaker’s name
Sahara: Usecases
•Cluster Management (API V1.0)
•On-demand, scalable, persistent clusters
•Supports multiple plugins
•Integrates with Heat, Glance, Nova, Neutron, and Cinder
•EDP (Elastic Data Processing ) (API V1.1)
•Supports multiple job types (Java, MR, Hive, Pig, Spark...)
•Supports transient clusters (spin up, process, shut down) or persistent clusters
•Integrates with Swift (optionally) and services on Vms
ROME 27-28 march 2015 - Speaker’s name
Sahara: end-user workflow
ROME 27-28 march 2015 - Speaker’s name
ROME 27-28 march 2015 - Speaker’s name
Questions ?