Benefits of Hadoop as Platform as a Service

34
Dublin, 14 April 2016 Benefits of Hadoop as Platform as a Service Aaron Call Barcelona Supercomputing Center www.bsc.es

Transcript of Benefits of Hadoop as Platform as a Service

Page 1: Benefits of Hadoop as Platform as a Service

Dublin, 14 April 2016

Benefits of Hadoop as Platform as a

Service

Aaron Call

Barcelona Supercomputing Center

www.bsc.es

Page 2: Benefits of Hadoop as Platform as a Service

Barcelona Supercomputing Center

Page 3: Benefits of Hadoop as Platform as a Service

BSC – Barcelona Supercomputing Center

3

23 years resarch on computer architecture

• European Center for Parallelism of Barcelona (CEPBA)

• Based at the Polytechnical University of Catalonia (UPC)

Led by Mateo Valero

• Seymour Cray 2015, first european to win it

• ACM fellow, Eckert-Mauchly award in 2007, Google award 2009

Large resarch staff

• 1000+ publications

Page 4: Benefits of Hadoop as Platform as a Service

BSC – Barcelona Supercomputing Center

4

Many life sciencies computational projects• Computational Genomics

• Molecular modeling and bioinformatics

• Protein interactions and docking

• In place computational capabilities

• Mare Nostrum supercomputer

Research activity around Hadoop since 2008

• Data-centric research group:

http://www.bsc.es/computer-sciences/data-centric-

computing

• SLA-driven scheduling (adaptive scheduler)

• Project ALOJA

Page 5: Benefits of Hadoop as Platform as a Service

ALOJA

Page 6: Benefits of Hadoop as Platform as a Service

Automated characterization of cost-effectiveness of Big Data

deployments

Seeks to provide knowledge and tools aiming to help users reduce the

TCO of infrastructures

About the project

6

Page 7: Benefits of Hadoop as Platform as a Service

What is the most effective configuration for my needs?

About the project

7

Page 8: Benefits of Hadoop as Platform as a Service

On ALOJA we acquired large knowledge on the behavior of On-

Premise and IaaS hadoop deployments

60k+ runs

Public repository

8

Page 9: Benefits of Hadoop as Platform as a Service

What it is best for one workload it is not for all

Lessons learnt from IaaS

9

Disks and network impact Local vs remote disks

HDD-IB

SSD-ETH

HDD-ETH

SSD-IB

Local only

1 Remote

2 Remotes

3 Remotes

1 Remote /tmp local

2 Remote /tmp local

3 Remote /tmp local

Page 10: Benefits of Hadoop as Platform as a Service

PaaS Advantages

Page 11: Benefits of Hadoop as Platform as a Service

Provides an automated setup of BigData services (Hadoop, Spark,

Hive..)

• Optimized for the underlying hardware

• Removes cost of installation

The service provider is in charge of maintenance

• Reduces TCO

• As any cloud service you pay as you go

Platform as a Service

12

Page 12: Benefits of Hadoop as Platform as a Service

O'Reily made a survey on data science salaries and estimated an

average salary of 140.000 US$ for a data engineer

Within a cluster of 16 datanodes on HDInsight of A3 machines, for a

year it costs:

• (16 datanodes + 2 headnodes) * 0.2384/hr = 4.2912 $US/hr =>

4.2912*24*365 = 37,590.912 $US/year

Hence, on ideal conditions we can save up to 102,409.088 $US per

year

How much spent on maintenance?

13

Page 13: Benefits of Hadoop as Platform as a Service

Some current solutions

• Azure HDInsight

• Rackspace CBD

• Amazon EMR

• Google Cloud Platform

Platform as a Service

14

Page 14: Benefits of Hadoop as Platform as a Service

Linux-based clusters of 4,8 and 16 datanodes

• Azure HDInsight and Rackspace CBD

• Azure IaaS and Rackspace IaaS clusters as well

Clusters of up to 8 cores / per node and 64 GB RAM

HDInsight: azure storage HDFS (remote disks)

Rackspace CBD: nodes’ local disks as HDFS

Evaluation environment

15

Page 15: Benefits of Hadoop as Platform as a Service

Wordcount

• CPU intensive: useful to analyze scalability of the nodes between VM

sizes

Tested workloads

16

%user %system %steal %iowait %nice

Page 16: Benefits of Hadoop as Platform as a Service

Terasort

• Combined I/O and CPU loads, a de facto benchmark in the community

Tested workloads

17

Datasizes of 1, 10,100 and 1000 GB

This is enough to stress the system and get an overall behavior of it

%user %system %steal %iowait %nice

Page 17: Benefits of Hadoop as Platform as a Service

Runs repeated several times

Cloud variability (100GB runs)

18

Benchmark Provider Standard Deviation(%)

Terasort HDInsight 60%

Rackspace CBD 28%

Wordcount HDInsight 55%

Rackspace CBD 47%

Page 18: Benefits of Hadoop as Platform as a Service

Relevant factors tree

19

ALOJA-ML is a set of machine learning techniques and tools to estimate

executions’ behavior on the unexplored search space

Relevant factors tree: a tool that explores the parameters that changes most an execution’s behavior

Page 19: Benefits of Hadoop as Platform as a Service

Relevant factors tree

20

Resulting tree for PaaS executions

IOFileBuffer=131072Datasize

Benchmark=TerasortReplication

Benchmark=wordcountDatanodes

IOFileBuffer=262144Datasize

Page 20: Benefits of Hadoop as Platform as a Service

Relevant factors tree

21

Provider is not a

relevant factor

Page 21: Benefits of Hadoop as Platform as a Service

Relevant factors tree

22

But datasize changes which

is next important factor

Page 22: Benefits of Hadoop as Platform as a Service

IO File Buffer 10GB

23

Analysing IO File Buffer (most relevant parameter on the tree)

Page 23: Benefits of Hadoop as Platform as a Service

IO File Buffer 100GB

24

Page 24: Benefits of Hadoop as Platform as a Service

IO File Buffer 1TB

25

Whether to use one or the other it all depends on your application

Page 25: Benefits of Hadoop as Platform as a Service

Replication factor 100GB

26

Page 26: Benefits of Hadoop as Platform as a Service

Replication factor 1TB

27

Important but not making a significant difference

Page 27: Benefits of Hadoop as Platform as a Service

Datasize scalability terasort

28

4cores,15GB

Page 28: Benefits of Hadoop as Platform as a Service

Datasize scalability terasort

29

4cores,7GB 8cores,14GB 4cores,15GB 8cores,30GB

Page 29: Benefits of Hadoop as Platform as a Service

Datanodes impact, wordcount

32

4cores,15GB 8cores,30GB

Page 30: Benefits of Hadoop as Platform as a Service

Datanodes impact, terasort

33

4cores,7GB 8cores,14GB 4cores,15GB 8cores,30GB

Page 31: Benefits of Hadoop as Platform as a Service

Datanodes impact, terasort

34

Diminishing returns

$2.87 $2.88

8cores,14GB 4cores,15GB

Page 32: Benefits of Hadoop as Platform as a Service

Cost difference IaaS and PaaS

35

Provider VM Size IaaS US$/h PaaS US$/h

Azure/HDI 4 CPU, 7GB RAM $0,176/h $0,32/h

8 CPU, 15GB RAM $0,352/h $0,64/h

Rackspace/CBD 4vCPU,15GB RAM $0,555/h $0,7925/h

8vCPU,30GB RAM $1,11/h $2,776/h

Amazon/EMR 4vCPU,16G RAM $0,239/h $0,299/h

8vCPU,32GB RAM $0,479/h $0,599/h

IaaS is cheaper, but might increase TCO (maintenance on your own!)

Page 33: Benefits of Hadoop as Platform as a Service

Conclusions

36

Providers are not really significant

In public cloud, large datasizes or large clusters introduce problems

• A larger cluster may improve performance but be more expensive in the

end

PaaS allows you to save on maintenance

• But you still have to take care of tunning a bit

• Not as much as on IaaS

• Cheaper or not than IaaS it all depends on your business

Page 34: Benefits of Hadoop as Platform as a Service

Thank you!

For further information please contact

[email protected]

www.bsc.es