Real time-hadoop

Real-time Hadoop:The Ideal Messaging System for HadoopTed Dunning

Contact Information

Ted DunningChief Applications Architect at MapR Technologies

Committer & PMC for Apache’s Drill, Zookeeper & othersVP of Incubator at Apache Foundation

Email tdunning@apache.org tdunning@maprtech.com

Twitter @ted_dunning

Hashtags today: #stratahadoop #ojai

Don’t Miss These• Just-in-time optimizing a database

– Me! at 4:20 PM, Room 230 C, today• Why flow instead of state?

– Me! at 5:10 PM, Room 210 D/H, today• High Frequency Decisioning

– Jack Norris! at 11:00 PM, Room 210 B/F, tomorrow• Threat detection on streaming data

– Carol Macdonald! at 3:45 PM, Solutions Theater, tomorrow• Scaling Your Business … Zeta Architecture

– Jim Scott! at 5:10 PM, Room 210 D/H, tomorrow

And Also, a Little Fun

Come jam with us

The Big Data Boys and the Real-time Stream Band5:50 PM, MapR booth, today

Goals• Real-time or near-time

– Includes situations with deadlines– Also includes situations where delay is simply undesirable– Even includes situations where delay is just fine

• Micro-services– Streaming is a convenient idiom for design– Micro-services … you know we wanted it– Service isolation is a key requirement

Real-time or Near-time?• The real point is flow versus state (see talk later today)

• One consequence of flow-based computing is real-time and near-time become relatively easy

• Life may be a bitch, but it doesn’t happen in batches!

Agenda• Background / micro-services

• Global requirements

• Scale

A microservice is

loosely coupledwith bounded context

How to Couple Services and Break micro-ness• Shared schemas, relational stores• Ad hoc communication between services• Enterprise service busses• Brittle protocols• Poor protocol versioning

Don’t do this!

How to Decouple Services• Use self-describing data • Private databases• Infrastructural communication between services• Use modern protocols• Adopt future-proof protocol practices

• Use shared storage where necessary due to scale

What is the Right Structure for Flow Compute?• Traditional message queues?

– Message queues are classic answer– Key feature/bug is out-of-order acknowledgement– Many implementations– You pay a huge performance hit for persistence

• Kafka-esque Logs?– Logs are like queues, but with ordering– Out of order consumption is possible, acknowledgement not so much– Canonical base implementation is Kafka– Performance plus persistence

ScenariosProfile Database

The task

Traditional Solution

What Happens Next?

How to Get Service Isolation

New Uses of Data

Scaling Through Isolation

Lessons• De-coupling and isolation are key• Private data stores/tables are important,

– but local storage of private data is a bug• Propagate events, not table updates

ScenariosIoT Data Aggregation

Basic Situation

Each location has many

Multiple locations

What Does a Pump Look Like

TemperaturePressure

TemperaturePressureFlow

Winding temperature

VoltageCurrent

Basic Situation

Each location has many

Multiple locations

Basic Architecture Reflects Business Structure

Lessons• Data architecture should reflect business structure

• Even very modest designs involve multiple data centers

• Schemas cannot be frozen in the real world

• Security must follow data ownership

ScenariosGlobal Data Recovery

Lessons• Arbitrary number of topics important for simplicity + performance

• Updates happen in many places

• Mobility implies change in replication patterns

• Multi-master updates simplify design massively

Converged Requirements

What Have We Learned?• Need persistence and performance

– Possibly for years and to 100’s of millions t/s• Must have convergence

– Need files, tables AND streams– Need volumes, snapshots, mirrors, permissions and …

• Must have platform security– Cannot depend on perimeter– Must follow business structure

• Must have global scale and scope– Millions of topics for natural designs– Multi-master replication and update

The Importance of Common API’s• Commonality and interoperability are critical

– Compare Hadoop eco-system and the noSQL world• Table stakes

– Persistence– Performance– Polymorphism

• Major trend so far is to adopt Kafka API– 0.9 API and beyond remove major abstraction leaks– Kafka API supported by all major Hadoop vendors

What we do

Evolution of Data Storage

FunctionalityCompatibility

Scalability

LinuxPOSIX

Over decades of progress,Unix-based systems have set the standard for compatibility and functionality

Scalability

LinuxPOSIX

HadoopHadoop achieves much higher scalability by trading away essentially all of this compatibility

Scalability

LinuxPOSIX

Hadoop

MapR enhanced Apache Hadoop by restoring the compatibility while increasing scalability and performance

Scalability

LinuxPOSIX

Hadoop

Adding tables and streams enhances the functionality of the base file system

http://bit.ly/fastest-big-data

How we do this with MapR• MapR Streams is a C++ reimplementation of Kafka API

– Advantages in predictability, performance, scale– Common security and permissions with entire MapR converged data

platform• Semantic extensions

– A cluster contains volumes, files, tables … and now streams– Streams contain topics– Can have default stream or can name stream by path name

• Core MapR capabilities preserved– Consistent snapshots, mirrors, multi-master replication

MapR core Innovations• Volumes

– Distributed management– Data placement

• Read/write random access file system– Allows distributed meta-data– Improved scaling– Enables NFS access

• Application-level NIC bonding• Transactionally correct snapshots and mirrors

MapR's Containers

Each container contains Directories & files Data blocks

Replicated on servers No need to manage

directly

Files/directories are sharded into blocks, whichare placed into containers on disks

Containers are 16-32 GB segments of disk, placed on nodes

MapR's Containers

Each container has a replication chain

Updates are transactional Failures are handled by

rearranging replication

Container locations and replication

N1, N2N3, N2N1, N2N1, N3N3, N2

N3Container location database (CLDB) keeps track of nodes hosting each container and replication chain order

MapR ScalingContainers represent 16 - 32GB of data

Each can hold up to 1 Billion files and directories 100M containers = ~ 2 Exabytes (a very large cluster)

250 bytes DRAM to cache a container 25GB to cache all containers for 2EB cluster

But not necessary, can page to disk Typical large 10PB cluster needs 2GB

Container-reports are 100x - 1000x < HDFS block-reports Serve 100x more data-nodes Increase container size to 64G to serve 4EB cluster

Map/reduce not affected

But Wait, There’s More• Directories and files are implemented in terms of B-trees

– Key is offset, value is data blob– Internal transactional semantics guarantees safety and consistency– Layout algorithms give very high layout linearization

• Tables are implemented in terms of B-trees– Twisted B-tree implementation allows virtues of log-structured merge tree

without the compaction delays– Tablet splitting without pausing, integration with file system transactions

• Common security and permissions scheme

And More …• Streams are implemented in terms of B-trees as well

– Topics and consumer offsets are kept in stream, not ZK– Similar splitting technology as MapR DB tables – Consistent permissions, security, data replication

• Standard Kafka 0.9 API• Plans to add OJAI for high-level structuring

• Performance is very high

Example

Lessons• API’s matter more than implementations

• There is plenty of room to innovate ahead of the community

• Posix, HDFS, HBASE all define useful API’s

• Kafka 0.9+ does the same

Call to action:

Support the Kafka API’s

Call to action:

Support the Kafka API’s

And come by the MapR boothto check out MapR Streams

Short Books by Ted Dunning & Ellen Friedman• Published by O’Reilly in 2014 - 2016• For sale from Amazon or O’Reilly• Free e-books currently available courtesy of MapR

http://bit.ly/ebook-real-world-hadoop

http://bit.ly/mapr-tsdb-ebook

http://bit.ly/ebook-anomaly

http://bit.ly/recommendation-ebook

Free copies at book signing today

http://bit.ly/mapr-ebook-streams

Thank You!

Q & A@mapr maprtech

tdunning@maprtech.com

Engage with us!

maprtech

mapr-technologies

Real time-hadoop

Software

Transcript of Real time-hadoop

Introduction to Real-Time Analytics with Cassandra and Hadoop

Benefits of Transferring Real-Time Data to Hadoop at Scale

Real-Time Searching of Big Data with Solr and Hadoop

Real-time Stream Processing with Apache Flink @ Hadoop Summit

Real-Time Data Replication to Hadoop using GoldenGate 12c Adaptors

Real Time Micro-Blog Summarization based on Hadoop/HBase

Real-Time Processing in Hadoop for IoT Use Cases - Phoenix HUG

Set Up & Operate Real-Time Data Loading into Hadoop

Hadoop - sterlingittrainings.comsterlingittrainings.com/.../HADOOP.pdf · Use of MR Chaining in Real Time Hadoop Projects Real Time Use Case Performance Trade off’s using MR Chaining

Spring XD for Real-Time Hadoop Workload Analysis

Real-time Streaming Analysis for Hadoop and Flume - Cdn.

Accelerate Real Time Data Ingest into Hadoop - …hortonworks.com/wp-content/uploads/2016/09/Att... · Accelerate Real Time Data Ingest into Hadoop Hortonworks Roadshow ... • Full

Pepperdata's Real-time Hadoop Cluster Optimization

Real-time Streaming Analysis for Hadoop and Flume Presentation

Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop

Jan 2013 HUG: Impala - Real-time Queries for Apache Hadoop

Real time analytics using Hadoop and Elasticsearch

COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS from Structure:Data 2013

ScaleOut hServerv2: Enabling Real-Time Analytics Using Hadoop Map/Reduce

Real time machine learning visualization with spark -- Hadoop Summit 2016