Big data and containers

Post on 20-Jul-2015

154 views 0 download

Transcript of Big data and containers

Big Data and Containers

Charles Smith@charles_s_smith

Netflix / Lead the big data platform architecture team

Spend my time / Thinking how to make it easy/efficient to work with big data

University of Florida / PhD in Computer Science

Who am I?

“It is important that we know where we come from, because

if you do not know where you come from, then you don't

know where you are, and if you don't know where you are,

you don't know where you're going. And if you don't know

where you're going, you're probably going wrong.”

Terry Pratchett

Database Distributed Database Distributed Storage

Distributed Processing

???

Why do we care about containers?

Containers ~= Virtual Machines

Virtual Machines ~= Servers

Lightweight

fast to start

memory use

Secure

Process isolation

Data isolation

Portable

Composable

Reproducible

Everything old is new

Microservices and large architectures

Datastorage(Cassandra, MySQL, MongoDB, etc..)

Operational(Mesos, Kubernetes, etc...)

Discovery/Routing

What’s different about big data.

Data at rest

Data in motion

Customer Facing

Minimize latency

Maximize reliability

Data Analytics

Minimize I/O

Maximize processing

Ship computation to data

The questions you can answer aren’t predefined

Hive/Pig/MR

Presto

Metacat

Hive

Metastore

That doesn’t look very container-y(or microservicy-y for that matter)

Datastorage - HDFS (Or in our case S3)

Operational - YARN

Containers - JVM

So what happens when you want to do something else?

But is that really the way we want to approach containers?

What’s different about big data.

Running many different short-lived processes

Running many different short-lived processes

Efficient container construction, allocation, and movement

Groups of processes having meaning

Groups of processes having meaning

How we observe processes needs to be holistic

Processes need to be scheduled by data locality(And not just data locality for data at rest)

Processes need to be scheduled by data locality(And not just data locality for data at rest)

A special case of affinity (although possibly over time)

but...

We do need a data discovery service.(kind of… maybe… a namenode?)

SELECT

t.title_id,

t.title_desc,

SUM(v.view_secs)

FROM

view_history as v

join title_d as t on

v.title_id =

t.title_id

WHERE

v.view_dateint > 20150101

GROUP BY 1,2;

LOAD LOAD

JOIN

GROUP

Data

Discovery

Query Compiler

Query Planner

Metadata

DAG

Watcher

Bottom line

Containers provide process level security

The goal should be to minimize monoliths

This isn’t different from what we are doing already

Our languages are abstractions of composable-distributed processing

Different big data projects should share services

No matter what we do, joining is going to be a big problem

Questions?