Pachyderm: Building a Big Data Beast On Kubernetes

14
Pachyderm Building a Big Data Beast on Kubernetes Joe Doliner Founder & CEO [email protected]

Transcript of Pachyderm: Building a Big Data Beast On Kubernetes

Pachyderm

BuildingaBigDataBeastonKubernetes

JoeDolinerFounder&[email protected]

About me

The origin storyWanted to analyze chess games with Hadoop

Let’s build a modern Hadoop!Oh shit!

First I need to build 15 years of distributed systems…

Distributed systems are hardBet it all on the container ecosystem

Pachyderm’sArchitectureKubernetes

UserAnalysis

PachydermPipelineSystem

Services Jobs

PachydermFile

System

UserData

PachydermFileSystem

Acopy-on-writedistributedfilesystemCopy-on-writeistheparadigmthat“powers”technologieslike

DockerandSparkCorestorageforPachyderm

Whyisthiscool?

• Viewdiffs• InstantRevert• Reducestorageneeds• Reliability

Commit

0

Commit

1

Commit

2

Commit

3

Commit

4

Gitforhugedatasets

PachydermPipelineSystem

• Runsk8sjobsoverPFS• Jobstriggeredbycommits

• Understandsjobdependencies• Leveragescopy-on-writestorage

Task1

Task2 Task3

Task4

Dashboard

Task5

Task6

Data-awarecontainerscheduler

Pachydermis…

Task1

Task2 Task3

Task4

Dashboard

Task5

Task6

$Task2failed$Task4and6waiting…

…Fixingcode…

$Task2resuming...$Task2complete$Task4starting…

Monitoring

Resilient:K8sjobscanberestarted

Efficient:incrementalprocessing

3

2

1

0

Data Analysis

Task4

DashboardTask6

Task1

Task2 Task3

Task5

1%moredata

Task4

DashboardTask6

Pachydermis…

PFSstoragenodes

PPS

Copy-on-writestoragenodes

Elasticallyscalingcomputationnodes

d2.8xlarge

PPSPPS

PPSSpot

SpotSpot

Cost-effective:resourcemanagement

Pachydermis…

Summary

Kubernetesisagame-changerfordistributedsystems

Copy-on-writedataisreallypowerful

PachydermunlocksthepowerofKubernetesforbigdata

ThankYou!

Questions?

pachyderm.io

[email protected]

github.com/pachyderm/pachyderm