Scaling Big Data with Hadoop and Mesos
-
Upload
discover-pinterest -
Category
Engineering
-
view
296 -
download
0
description
Transcript of Scaling Big Data with Hadoop and Mesos
Scaling Big Data with
Hadoop And Mesos
Bernardo Gomez Palacio
Software Engineer at Guavus Inc
Beyond Buzz Words
Mesos and Data AnalysisYes, you don't need Hadoop to start using Mesos and Spark.
Now, If You...4 Need to store large files? by default each block is 128MB.
4 Data is written mainly as new files or by appending into existing ones?
Convinced you want to jump into the Hadoop bandwagon?Read
Sammer, Eric. "Hadoop Operations." Sebastopol, CA: O'Reilly, 2012. Print.
Welcome to the Jungle
Version Hell
DistributionsApache Bigtop, CDH, HDP, MapR
HadoopHDFS
MRV1
MRV2
Assuming You Already Have Mesos4 Mesosphere Packages
4 https://mesosphere.io/downloads/
4 From Source.
4 https://github.com/apache/mesos
Hadoop MRV1 in Mesohttps://github.com/mesos/hadoop
Hadoop MRV1 in Mesos4 Requires Hadoop MRV1
4 Officially works with CDH5 MRV1
4 Apache Hadoop 0.22, 0.23 and 1+
4 Apache Hadoop 2+ doesn't come with MRV1!
Hadoop MRV1 in Mesos4 Requires a JobTracker.
4 By default uses the org.apache.hadoop.mapred.JobQueueTaskScheduler
4 You can change it .e.g ...mapred.FairScheduler
Hadoop MRV1 in Mesos4 Requires TaskTracker.
4 That is org.apache.hadoop.mapreduce.server.jobtracker.TaskTracker.
4 And not org.apache.hadoop.mapred.TaskTracker.java.
How Hadoop MRV1 Runs In Mesos?
How Hadoop MRV1 in Mesos works?1. Framework Mesos Scheduler creates the Job
Tracker as part of the driver.
2. The Job Trakcer will use org.apache.hadoop.mapred.MesosScheduler to lunch tasks.
Mesos Hadoop Task Scheduling4 mapred.mesos.slot.cpus (1)
4 mapred.mesos.slot.disk (1024MB)
4 mapred.mesos.slot.mem (1024MB)
Additional Mesos parameters4 mapred.mesos.checkpoint (false)
4 mapred.mesos.role (*)
ThoughtsWhat about Hadoop 2.4?
Namenode HA?
MRV2 and YARN?
Personal Preference4 Use Hadoop 2.4.0 or above.
4 Name Node HA through the Quorum Journal Manager.
4 Move to Spark if Possible.
Example of a Mesos Data Analysis Stack1. HDFS stores files.
2. Use the Spark CLI to test ideas.
3. Use Spark Submit for jobs.
4. Use Chronos or Oozie to schedule workflows.
Spark On Mesos
Spark On Mesos
https://spark.apache.org/docs/latest/img/cluster-overview.png
Know that Each Spark Application1. Has its own driving process.
2. Has its own RDDs
3. Has its own cache.
Spark Schedulers on MesosFine Grained
Coarse Grained
Spark Fine Grained Scheduling4 Enabled by default.
4 Each Spark task runs as a separate Mesos task.
4 Has an overhead in launching each task.
Spark Coarse Grained Scheduling4 Uses only one long-running Spark task on each Mesos
slave.
4 Dynamically schedules its own “mini-tasks”, using Akka.
4 Lower startup overhead.
4 Reserving the cluster resources for the complete duration of the application.
Be ware of...4 Greedy Scheduling (Coarse Grain)
4 Over committing and deadlocks (Fine Grained)
Using SparkUnderstand Parametrization and Usage4 spark.app.name
4 spark.executor.memory
4 spark.serializer
4 spark.local.dir
4 ....
Use Spark SubmitAvoid parametrizing the Spark Context in your code as much as possible.
Leverage the spark-submit arguments, properties files as well as environment variables to configure your application.
Using Spark
Accept That Tunning is a Science & an Art
Understand and Tune Your Applications4 Know your Working Set.
4 Understand Spark Partitioning and Block management.
4 Define your Spark workflow and where to cache/persist.
4 If you cache you will serialize, use Kryo.
Example Spark API PairRDDFunctions def combineByKey[C]( createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]
PairRDDFunctions.combineByKey4 Combines the elements for key using a custom set of
aggregations.
4 RDD[(K, V)] to RDD[(K, C)]
PairRDDFunctions.combineByKey
4 createCombiner: Turns a V into a C
4 mergeValue: merge a V into a C
4 mergeCombiners: to combine two C's into a single one.
partitioner defaults to HashPartitioner.
Example Spark API PairRDDFunctionsself: RDD[(K, V)]
def aggregateByKey[U: ClassTag](zeroValue: U)( seqOp: (U, V) => U, combOp: (U, U) => U ): RDD[(K, U)]
Uses the default partitioner.
Understand your Data
Tune your Data4 Per Data Source understand its optimal block size
4 Leverage Avro as the serialization format.
4 Leverage Parquet as the storage format.
4 Try to keep your Avro & Parquet schemas flat.
Suggestions
Each Application
4 Instrument the Code.
4 Measure Input size in number of records and byte size.
4 Measure Output size in the same way.
Standardize
4 JDK & JRE version across your cluster.
4 The Spark version across your cluster.
4 The libraries that will be added to the JVM classpath by default.
4 A packaging strategy for your application, uber jar.
About YARN and Spark
Some Differences with YARN
4 Execution Cluster vs Client modes.
4 Isolation process vs cgroups
4 Docker support? LXC Templates?
4 Deployment complexity?
Wrapping Up
Some Ideas..
References1. "Hadoop - Apache Hadoop 2.4.0." Apache Hadoop
2.4.0. Apache Software Foundation, 31 Mar. 2014. Web. 24 July 2014. link.
2. "Hadoop Distributed File System-2.4.0 - HDFS High Availability Using the Quorum Journal Manager." Apache Hadoop 2.4.0. Apache Software Foundation, 31 Mar. 2014. Web. 23 July 2014. link.
References1. Sammer, Eric. Hadoop Operations. Sebastopol, CA:
O'Reilly, 2012. Print.
2. "Spark Configuration." Spark 1.0.1 Documentation. Apache Software Foundation, n.d. Web. 24 July 2014. link.
3. "Tuning Spark." Spark 1.0.1 Documentation. Apache Software Foundation, n.d. Web. 24 July 2014. link.
References1. Ryza, Sandy. "Managing Multiple Resources in
Hadoop 2 with YARN." Cloudera Developer Blog. Cloudera, 2 Dec. 2013. Web. 24 July 2014. link.
Thank you! ✌