Evolving a data-driven company from MapReduce to Spark - Ferran Galí Reniu @ PAPIs Connect

35
Evoluting a data-driven company from MapReduce to Spark Ferran Galí i Reniu @ferrangali

Transcript of Evolving a data-driven company from MapReduce to Spark - Ferran Galí Reniu @ PAPIs Connect

Evoluting a data-driven company from MapReduce to

SparkFerran Galí i Reniu

@ferrangali

About me

@ferrangali

Classified Ads

Searching for Home

Searching for second hand cars

Searching for job offers

Trovit

Trovit

Vertical Search Engine

(Almost) Everywhere

Lots of data!

Ads BehaviorsSaved data

Apache Hadoop

Vertical vs Horizontal scalability

vs

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

YARN

Hardware

Storage

Resource Manager

The Big Data problemHadoop

Processing

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

YARN

Hardware

Storage

Resource Manager

The Big Data problemHadoop

Processing Job

Application

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

YARN

Hardware

Storage

Resource Manager

The Big Data problemHadoop

Processing Job

Application

Making the business flow

Business Intelligence

Search engine

Mailing Push Notifications

Online Media Buying

Added value to the site

And even pretty maps!

MapReduce API is not flexible at all!

Map

Reduce

Map

Map

Reduce

Map

Map

Reduce

Map

MapReduce is disk intensive!

Map

Reduce

Map

Map

Reduce

Map

Map

Reduce

Map

Apache Spark

Spark Core API: RDDs

filter() {...}

Spark Core API: RDDs

groupByKey() {...}

join() {...}

filter() {...}

Spark Core API: RDDs

groupByKey() {...}

filter() {...}

join() {...}

filter() {...}

Spark Core API: RDDs

groupByKey() {...}

filter() {...}

join() {...}

filter() {...}

Spark Core API: RDDs

groupByKey() {...}

collect()write()count()

...

Spark API is more flexible!

Map

Reduce

Map

Map

Reduce

Map

Reduce

Spark can do it in-memory!

Map

Reduce

Map

Map

Reduce

Map

Reduce

The Big Data problemSpark stack

Spark Core (RDDs)

Spark SQLMachine

Learning LibraryStreaming GraphX

The Big Data problemSpark stack (with DataFrames)

Spark Core (RDDs)

Spark DataFrames API

Spark SQLMachine

Learning Library

Streaming GraphX

Node Node Node Node Node Node Node Node

HDFS - Hadoop Distributed File System

YARN

Hardware

Storage

Resource Manager

The Big Data problemEasy integration with Hadoop

Processing Job

Application

Job

Hands-on

Questions?

Evoluting a data-driven company from MapReduce to

SparkFerran Galí i Reniu

@ferrangali

Icons made by Freepik from Flaticon is licensed by CC BY 3.0