Hadoop Spark and Scala Online Training

7
Apache Spark What is it ? How does it work ? Benefits Tuning Examples www.xoomtrainings.com [email protected]

Transcript of Hadoop Spark and Scala Online Training

Page 1: Hadoop Spark and Scala Online Training

Apache Spark

● What is it ?

● How does it work ?

● Benefits

● Tuning

● Examples

www.xoomtrainings.com [email protected]

Page 2: Hadoop Spark and Scala Online Training

Spark – What is it ?

● Open Source

● Alternative to Map Reduce for certain applications

● A low latency cluster computing system

● For very large data sets

● May be 100 times faster than Map Reduce for– Iterative algorithms

– Interactive data mining

● Used with Hadoop / HDFS

● Released under BSD License

www.xoomtrainings.com [email protected]

Page 3: Hadoop Spark and Scala Online Training

Spark – How does it work ?

● Uses in memory cluster computing

● Memory access faster than disk access

● Has API's written in– Scala

– Java

– Python

● Can be accessed from Scala and Python shells

● Currently an Apache incubator project

www.xoomtrainings.com [email protected]

Page 4: Hadoop Spark and Scala Online Training

Spark – Benefits

● Scales to very large clusters

● Uses in memory processing for increased speed

● High Level API's– Java, Scala, Python

● Low latency shell access

www.xoomtrainings.com [email protected]

Page 5: Hadoop Spark and Scala Online Training

Spark – Tuning

● Bottlenecks can occur in the cluster via– CPU, memory or network bandwidth

● Tune data serialization method i.e.– Java ObjectOutputStream vs Kryo

● Memory Tuning– Use primitive types

– Set JVM Flags

– Store objects in serialized form i.e.

● RDD Persistence

● MEMORY_ONLY_SER

www.xoomtrainings.com [email protected]

Page 6: Hadoop Spark and Scala Online Training

Spark – Examples

• Example from spark-project.org, Spark job in Scala.

• Showing a simple text count from a system log.

• /*** SimpleJob.scala ***/

• import spark.SparkContext

• import SparkContext._

• object SimpleJob {

• def main(args: Array[String]) {

• val logFile = "/var/log/syslog" // Should be some file on your system

• val sc = new SparkContext("local", "Simple Job", "$YOUR_SPARK_HOME",

• List("target/scala-2.9.3/simple-project_2.9.3-1.0.jar"))

• val logData = sc.textFile(logFile, 2).cache()

• val numAs = logData.filter(line => line.contains("a")).count()

• val numBs = logData.filter(line => line.contains("b")).count()

• println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))

• }

• }

www.xoomtrainings.com [email protected]

Page 7: Hadoop Spark and Scala Online Training

Contact Us

● Feel free to contact us at

– www.xoomtrainings.com

[email protected]

-- USA : +1-610-686-8077 or India : +91-404-018-3355

● We offer IT project consultancy

● We are happy to hear about your problems

● You can just pay for those hours that you need

● To solve your problems